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Sir: 

I, Monika Wood, M.S., declare and say as follows: 

1. I am one of the named co-inventors of the claims in the above-identified 
application. I make this Declaration in support of the patentability of the claims of the 
above-identified application. 



2. Sherf et al. (U.S. Patent No. 5,670,356) disclose that a firefly luciferase (luc) 
gene was modified using mammaUan codon replacement to remove 3 internal 
palindromic sequences, 5 restriction endonuclease sites, 4 glycosylation sites, and 6 
transcription factor binding sites, yielding luc-i-. On June 14* 2006, using publicly 
available software and a database of transcription factor binding sites (see attached details 
on the specific software, search parameters, and database release used), comparable to 
those employed in the above-referenced application, potential mammalian transcription 
factor binding sites were identified m the /wc+ gene. I found that the /wc+ gene contains 
over 150 potential mammalian transcription factor binding sites. 

3. Thus, mammalian transcription factor binding sites in a particular nucleic acid 
sequence can be identified arid enumerated, 

4. I further declare that all statements made herein of my own knowledge are true, 
and that all statements made on information and belief are believed to be true; and further 
that these statements were made with the knowledge that willful false statements and the 
like so made are punishable by fme or imprisonment, or both, under Section 1001 of Title 
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validity of the application or any patent issued thereon. 
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TESS - Filtered String Searchi Page 

Home I Site Searches | Query Tra n sfac | Query Matrices | Other Muff 
About I Strings | Filtered Strings | Cpmbined | Recall JSearc 
Check our FAQ page then please send questions and comments to TessMaster@cbiLupenn,e^^^^ 
Database Versions; TRANSFAC=4.0. IMD=v1.1, CBIL/GibbsMat=v1.1 

The TRANSFAC database is free for non-commercial use. For commercial use the TRANSFAC databases and programs have to be licensed. Please read the 

DIS CLA IM E R! 

New Feature: The tabular display page of the site search now displays the ordinal of the sorted hits. 
The site is basically working. Please report any errors you encounter. 
To keep the load on our server to a reasonable level, we have implemented a cap on the number of 
jobs that are waiting to execute. When submitting a search job, you may see a message asking you 
to submit your job later. In that case, wait a few minutes and try again. 



What potential transcription factor binding sites are there in my sequence? 



Input 

Enter the minimal information needed to submit a job to TESS. 



Title: |luc+ 



J (text) 



You may submit multiple sequences in this window. Each sequence can have a maximum 
length of 2000[bp]. The sequences must a total length less than 100000[bp], 



DNA 
Sequence(s): 



atggaagacgccaaaaacataaagaaaggcccggcgccattctatccgctggaagatgga 
accgctggagagcaactgcataaggctatgaagagatacgccctggttcctggaacaatt 
gcttttacagatgcacatatcgaggtggacatcacttacgctgagtacttcgaaatgtcc 

gttcggttggcagaagctatgaaacgatatgggctgaatacaaatcacagaatcgtcgtal 

tgcagtgaaaactctcttcaattctttatgccggtgttgggcgcgttatttatcggagtt , 

gcagttgcgcccgcgaacgacatttataatgaacgtgaattgctcaacagtatgggcatt \ 

tcgcagcctaccgtggtgttcgtttccaaaaaggggttgcaaaaaattttgaacgtgcaa 

aaaaagctcccaatcatccaaaaaattattatcatggattctaaaacggattaccaggga 

tttcagtcgatgtacacgttcgtcacatctcatctacctcccggttttaatgaatacgat 

tttgtgccagagtccttcgatagggacaagacaattgcactgatcatgaa ctcc tctgga|;^| 

{ text) 



Length of 
time to store 
results of the 
job: 

Your email 
address: 



|day 



J {string) 



End of Minimal Parameters 

You can click 'Submit' to submit the job or scroll down to change the basic search parameters. 

Submit 



Databases 

Check off the databases you want to include in the search and enter your own search strings 
and/or weight matrices. 
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String Databases 

Search 
F TRANSFAC 
Strings 



r 



Search My 
Site Strings: 



{text) 

Factor Filters 

Use this section to control which factors are included in the search. 

The buttons control the number of terms used in the filter. When you click on a button this form 
will reload with the adjusted number of terms, 

Fewer | More] ^ jj D ^ 14] ^ 
Factor Attribute 



1: 



Organism Classification 



3 



i(text) 



matches: ImafJ^nnalia 

Score Filters 

Adjust these parameters to control the required strength of the match between the site and the 
string or weight matrix model. 

String Scoring 

Use these parameters if you have chosen to search for string matches. 



Use only core positions for TRANSFAC strings 
Maximum Allowable String Mismatch % (t^^) 

Minimum log-likelihood ratio score 

Minimum string length (t^^) 

Output Control 

Secondary Lg-Likelihood Deficit: fli 



□ 

KB 

|10 



\{real number>=0 ) 



KB 



[real number>=0,0 and <=6.0) 



r Count significance threshold: llOe-Z \{real number>=0.0 and <=1,0) 

Click Submit or scroll down to adjust the expert search parameters. 
Submit 



Submit 



Reset 
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Help 

Title: Q 

This is a short title for your sequence which will appear in the results. Using as you see fit to 
identify the sequence. 

The parameter's type is text. An item of text comprises an arbitrary sequence of characters, 
possibly including white-space and newlines. 

DNA Sequence(s): @ 

Enter your nucleic acid sequence(s) here using the lUB standard. TESS ignores the case of the 
letters and the presence of digits and white space. That means you can cut the sequence 
section from a GenBank entry and paste in here without any editing. 

You may submit multiple sequences in this window. They will be processed individually. For 
example: 

>Seql 

acgtagtagagctaga 
>Seq2 

acgtagcatgactgggatatatatatat 

The parameter's type is text. An item of text comprises an arbitrary sequence of characters, 
possibly including white-space and newlines. 
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Length of time to store results of the job: W 

Select the length of time you want us to store the results of this job. We'll try to keep it this long. 

The parameter's type is text. An item of text comprises an arbitrary sequence of characters, 
possibly including white-space and newlines. 

Your email address: ^ 

Enter the email address to which you want material sent. 

The parameter's type is string. A string is a single group of non-white space characters, e.g. 
'this-is-a-string' but not 'this is not'. 

Search TRANSFAC Strings: # 

Select this option to search for matches to TRASNSFAC sites in your query sequence. 
Search My Site Strings: # 

Select this option and enter site strings if you want search for your own site strings in the query 
sequence(s). 

Strings should be placed one per line with a trailing name separated from the sequence by one 
or more spaces or tabs. They will be assigned accession numbers from the series U00001 , 
U00002, etc. 

Here is a sample: 

agtctgannnnagtca factor x 
aggtggaa hairy eyeball 

The parameter's type is text. An item of text comprises an arbitrary sequence of characters, 
possibly including white-space and newlines. 

Factor Attribute 1: Q 

Choose an attribute of the FACTOR database which you want to use to select factors to search 
for in your sequence. 

See also: rxp 

The parameter's type is text. An item of text comprises an arbitrary sequence of characters, 
possibly including white-space and newlines. 

matches: ^ 
Enter the pattern that must be found in a factor's attribute to be included in the search. You can 
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follow these links to see what possible values each field takes on. This may help in forming 
queries. 

Org a n is m Spe c ies 
Orga n i s m Classification 
Name 
Syno nym s 
In te racting F a ctors 
Class 

Cell Positive Specificity 
Cell Negative Specificity 
Id 

Use the "External Database References" option to search for factors by their EMBL, SwissProt, 
Flybase, Compel, or PIR accession numbers or ids. For example to find 'EMBL: J03236; 
MMJUNBA' you can enter either 'J03236' or *MMJUNBA\ 

See also: att 

The parameter's type is text. An item of text comprises an arbitrary sequence of characters, 
possibly including white-space and newlines. 

Use only core positions for TRANSFAG strings: 

Site strings in TRANSFAC indicate which positions are important to binding but also include 
unimportant positions as well. 

If you select this option, then the unimportant positions will be removed from the site string prior 
to searching. 

The parameter's type is Boolean. Either True/False, Yes/No, 0/1, or On/Off. 

Maximum Allowable String Mismatch % (t^,^): ^ 

TESS will consider a transcription element to match a part of your sequence as long as the 
percentages of mismatch is below the specified level. 

The number you enter here is an integer percent value and so must be between 0 and 100. 

The parameter's type is integer. A series of digits optionally preceeded by a minus sign. 
Commas are ok. 

Minimum log-likelihood ratio score (tg,^): @ 

TESS will not report string matches with a log-likelihood ratio less than this value. Use this 
value as a hedge against matches against sites with many ambiguous characters. Such sites 
would score well in terms of percentage mismatch, but have a poor log-likelihood ratio. 

The log-likelihood ratio for strings is computed roughly as follows. Each match against an 
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unambiguous base is worth a LLR of 2. A match against a ambiguity code that represents two 
bases, e.g., S=C or G, is worth 1 . A match against an ambguity code that represents three 
bases, e.g., D = A, G, or T is worth about 0.75. A match against an *N' is worth 0. 

The parameter's value is constrained to be >=0 . 

The parameter's type is real number. The standard decimal format plus scientific notation (eg, 
2.1e-32.1). The decimal point is optional. An empty string defaults to '0.0'. Use commas or 
white space to make the number more legible. 

Minimum string length (t^): @ 

TESS will not search for sites that are shorter than this length. 

Set this value to higher values to avoid getting swamped with weak hits. 

The parameter's type is integer. A series of digits optionally preceeded by a minus sign. 
Commas are ok. 

Secondary Lg-Likelihood Deficit: @ 

This threshold is used to highlight alignments that are especially good. Alignments that fail to 
meet this threshold are reported but are indicated in the sequence display by magenta or cyan 
rather than red or blue. 

See also: mi l d 



The parameter's value is constrained to be >=0.0 and <=6.0. 

The parameter's type is real number. The standard decimal format plus scientific notation (eg, 
2.1e-32.1). The decimal point is optional. An empty string defaults to '0.0*. Use commas or 
white space to make the number more legible. 

Count significance tlireshold: ^ 

This threshold is used to remove those matrices that do not produce a significantly high number 
of hits in the sequence. The total number of hits is tallied for PWM. The number of hits is 
approximated by a Poisson distribution with a rate estimated from empirical data measured on 
a random sequence generated from a uniform distribution using the log-likelihood or similarity 
threshold. A p-value (the probability of getting the same or more hits) is computed for each 
matrix. If you select this option then those PWMs with a p-value greater (more likely) than the 
threshold are eliminated. 

The parameter's value is constrained to be >=0.0 and <=1.0. 

The parameter's type is real number. The standard decimal format plus scientific notation (eg, 
2.1e-32.1). The decimal point is optional. An empty string defaults to '0.0'. Use commas or 
white space to make the number more legible. 
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INVITED EDITORIAL 

Genomic Sequence, Splicing, and Gene Annotation 

Stephen M. Mount 

Department of Cell Biology and Molecular Genetics, University of Maryland, College Park 



Introduction 

The sequence of the human genome is at hand. Most 
scientists who use the sequence will rely on annotations 
that provide information about the number and loca- 
tion of genes and about their inferred protein products. 
Traditionally, genes have been annotated by scientists 
with a particular interest in them. However, annota- 
tion of the complete human genome sequence will have 
to be at least partially automated. Gene annotation in- 
corporates cDNA data (including expressed sequence 
tags [ESTs]), sequence similarity, and computational pre- 
diaions based on the recognition of probable splice 
sites and coding regions (Stormo 2000; also see David 
Haussler's Web site, Computational Genefinding). The 
state of the art was recently surveyed by the Genome 
Annotation Assessment Project-GASPl and must be re- 
garded as imperfect (Bork 2000; Reese et al. 2000). 

This review enumerates aspects of pre-mRNA splicing 
that limit our ability to predict gene structure from ge- 
nomic sequence, drawing on the recently annotated 
complete genome of Drosophila melanogaster (Adams 
et al. 2000) as an example. In particular, the following 
four facts will be discussed. First, splice sites do not 
always conform to consensus. Second, noncodingexons 
are common. Third, internal exons can be arbitrarily 
small, and small internal exons confound not only gene 
finding but also the alignment of cDNA and genomic 
sequences. Fourth, splice sites are not recognized in iso- 
lation, and nucleotides that are far from splice sites can 
affect splicing. This list and the accompanying analysis 
should make molecular geneticists aware of the ways 
in which gene annotations can be wrong and should 
encourage recourse to the primary data. In addition, the 
same considerations indicate that inherited disease can 
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be caused by mutations remote from splice sites that 
nevertheless affect splicing. 

Discussion 

Splice Sites Do Not Always Conform to Consensus 

It is well established that nearly all splice sites conform 
to consensus sequences (Mount 1982; Senapathy et al. 
1990; Zhang 1998). These consensus sequences include 
nearly invariant dinucleotides at each end of the in- 
tron— GT at the 5' end of the intron and AG at the 3' 
end of the intron. Most gene-finding software and most 
human annotators will find only introns that begin with 
a GT and end with an AG. However, nonconsensus 
splice sites have been described, and I will discuss three 
classes, in decreasing order of frequency. 

The most common class of nonconsensus splice sites 
consists of 5' sphce sites with a GC dinucleotide. Sen- 
apathy et al. (1990) listed 17 examples among 3,724 5' 
splice sites, suggesting a frequency of '-0.5%. Jackson 
(1991) listed a total of 26 GC sites, whereas Wu and 
Krainer (1999) cited an additional 18 examples. GC 5' 
sphce sites are consistent with the experimental obser- 
vation that, of the six possible point mutations within 
the GT dinucleotide, mutation of T to C in position 2 
has the smallest effea on in vitro splicing (Aebi et al. 
1986). At other positions within the consensus, GC sites 
conform extremely well to the standard consensus; for 
example, 42 of the 44 sites cited above have a consensus 
G residue at both position -1 and position +5. It is 
reasonable to assume that GC sites are recognized by 
the standard (U2-dependent) spliceosome. 

The second class of exception to splice-site consensus 
is U12 introns, a minor class of rare introns with splice- 
site sequences that are very different from the standard 
consensus but that are very similar to each other. The 
existence of this class was first pointed out by Jackson 
(1991) and was considered in more detail by Hall and 
Padgett (1994). It was subsequendy discovered that U12 
introns are removed by a minor spliceosome containing 
the rare Ul 1, U12, U4atac, and U6atac snRNPs, in place 
of Ul, U2, U4, and U6 (Tarn and Steitz 1997; Burge et 
al. 1998). Some U12 introns have AT and AC in place 
of GT and AG and are known as "AT-AC introns. 
However, terminal intron dinucleotide sequences do not 
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distinguish between U2- and U12-dependent introns 
(Dietrich et ai. 1997), Rather, U12 introns can be iden- 
tified by highly conserved sequences at the 5' spUce site 
(RTATCCTY; R = A or G, and Y = C or T) and branch 
site (TCCTRAY). U12 introns are found in many eu- 
karyotes, including Drosophila tnelanogaster (Adams et 
al. 2000) and Arabidopsis thaliana (Shukla and Padgett 
1999) but not Caenorhabditis elegans. 

Finally, there are a small number of nonconsensus sites 
that fit into neither of the two categories mentioned 
above. Many reports of such variant splice sites can be 
traced to errors in annotation or interpretation, poly- 
morphic differences between the sources of cDNA and 
genomic sequence, inclusion of pseudogene sequences, 
or failure to account for somatic mutation (author's un- 
published data; for examples, see Jackson 1991). How- 
ever, there are many examples of sites that match the 
consensus very poorly, and experimental work has es- 
tabhshed that 5' splice sites do not absolutely require 
GT— and that 3' spUce sites do not absolutely require 
AG — in order to be recognized in vivo (Aebi et ai. 1986; 
Roller et al. 2000, and references therein). In yeast, an 
intron that is within the HACl mRNA and that has no 
similarity to the standard nuclear pre-mRNA intron con- 
sensus sequence is sphced by a specific, regulated, en- 
donuclease and tRNA ligase (Sidrauski et al. 1 996). This 
intron provides a precedent for introns in protein-coding 
genes with completely novel splice sites. 

Noncoding Exons Are Common 

There is considerable confusion between exons and 
coding regions. The term "exon" was coined by Gilbert 
(1978) to refer to what is left when introns are removed 
by sphcing, and RNAs that are entirely noncoding (such 
as tRNAs) are sometimes spliced. However, the term 
exon is often misused to refer to a stretch of coding 
information. In reality, however, noncoding exons are 
quite common, occurring in >35% of human genes 
(Zhang 1998). Gene-finding software generally detects 
sequence features characteristic of coding regions rather 
than of exons and does not even attempt to identify 
noncoding exons, or noncoding portions of exons. This 
is because the statistical biases introduced by protein- 
coding are in faa a very powerful tool for the identifi- 
cation of coding DNA, and no similar tool has been 
developed for the identification of noncoding exons. 

A similar problem can arise in genes without non- 
coding exons. If the first intron occurs near the initiator 
AUG, then the coding information in the first exon can 
be very short and difficult to identify by measures of 
coding tendency. Furthermore, the first intron tends to 
be longer than average (Maroni 1996), and such an ar- 
rangement can separate promoter function (perhaps in- 
cluding downstream transcriptional enhancer elements 



lying in the first intron) from the bulk of the coding 
information downstream. In these cases, investigators 
have no way of knowing how much information is miss- 
ing — or where the 5' end of the gene is likely to re- 
side — without experimental data such as a cDNA se- 
quence or a 5' EST, 

Internal Exons Can Be Arbitrahly Small 

A less frequent but perhaps more serious problem for 
gene-discovery methods is posed by small internal exons. 
Vertebrate internal exons have an average size of --130 
nucleotides (Hawkins 1988; Zhang 1998), and roughly 
65% of internal human exons are 68-208 nucleotides 
in length (Maroni 1996). This size distribution reflects 
a functional constraint. Optimal splicing efficiency re- 
quires exons with sizes of --50-300 nucleotides (Rob- 
berson et al. 1990; Dominski and Kole 1991; see re- 
view by Berget 1995). However, a considerable number, 
>10%, of exons are <60 nucleotides in length, and it is 
these exons that can be difficult to identify by measures 
of coding tendency. 

Just how small can internal exons be? There appears 
to be no lower limit, and many cases of exons <10 nu- 
cleotides have been described (for examples, see Stamm 
et al. 1994; also see the author's Web site, Gene An- 
notation and Splice Site Seleaion). An illustrative case 
is the invected gene of D. melanogaster (also listed in 
GadFly as CG17835). This gene encodes a homeodo- 
main protein that is similar to engrailed, and these two 
genes are adjacent. One of four invected exons is only 
6 nucleotides long and is flanked by introns of 27,659 
and 1,134 nucleotides. Significantly, this exon is not rec- 
ognized by cDNA alignment software such as SIM4 (Flo- 
rea et al. 1998), and the gene is incorrectly annotated 
(GenBank accession number AE003825.1). As a result, 
the protein sequence predicted by annotation of the 
genome (Adams et al. 2000; GenBank accession num- 
ber AAF58640) differs from that predicted from the 
cDNA (Coleman et al. 1987; GenBank accession number 
CAA28885), because of a frameshift affecting the entire 
carboxyl-terminal coding exon, a highly conserved re- 
gion of the protein. This is despite the fact that the mi- 
croexon sequence, GTCGAA, is flanked by intron se- 
quences that perfectly match the splice-site consensus. 
Use of this microexon provides perfect agreement be- 
tween the cDNA and genomic sequences when consen- 
sus splice sites are used, whereas the annotation predicts 
an RNA with several discrepancies relative to the cDNA. 
The frameshift is due to the predicted use of a 5' splice 
site 10 nucleotides downstream of the true 5' splice site, 
which was apparently selected to account for the mi- 
croexon. It seems clear that the protein sequence pre- 
dicted by the cDNA is correa. Why was it not incor- 
porated into the annotation? The alignment problem 
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arises because a pattern-matching algorithm that locates 
exons by similarity between the cDNA and the genomic 
sequence cannot find exons of this size unless its strin- 
gency is reduced to an unacceptable level (Florea et al. 
1998). 

The notion that exons can be arbitrarily small is sup- 
ported by the observation of exons with length 0. Of 
course, such sites are not exons at all but, rather, are 
resplicing sites (see fig. 1). This phenomenon has been 
demonstrated in the case of the Drosophila Ultrabi- 
thorax locus (Hatton et al. 1998), which has a region 
of 60 kb containing two alternatively spliced exons, and 
may be a general feature of long introns (J. Burnette and 
A. J. Lopez, personal communication). The existence of 
resplicing sites not only illustrates the lack of a lower 
limit to exon size (which has implications for gene an- 
notation) but also has implications for the analysis of 
hereditary mutations. A mutation at one of these sites 
could potentially create a frozen intermediate such as 
that diagrammed in figure 1. This partially spliced RNA 
would probably be unstable, because of nonsense-me- 
diated decay (Culbertson 1999), and the apparent result 
would be no RNA (rather than aberrantly spliced RNA). 
Such mutations would be very hard to identify. 

Nucleotides Far from Splice Sites Can Affect Splicing 

No method of evaluating potential splice sites that is 
based on sequence alone can be 100% reliable. One can 
be sure of this because many sequences that are not splice 
sites are capable of acting as splice sites, and vice versa. 
Perhaps the clearest demonstration of this is provided 
by the activation of cryptic splice sites. These are splice 
sites that are used, sometimes with 100% efficiency, 
when a natural splice site has been mutationally inac- 
tivated. The activation of cryptic sites occurs in ap- 
proximately one-third of spHcing mutations (Nakai and 
Sakamoto 1994). The phenomenon shows that the cryp- 
tic sites are perfectly capable of being recognized by the 
splicing machinery. Clearly, the sequence of such cryptic 
sites is compatible with splicing, and context is impor- 
tant for splice-site choice. 

Two contextual elements that contribute to splice- 
site seleaion are the location of splice sites relative to 
each other and splicing-enhancer sequences. The exon- 
size preferences described above are widely understood 
in terms of an exon-definition model that includes the 
interaction of splicing factors bound at either end of an 
exon (Berget 1995). The requirement for productive in- 
teraaions among splicing factors, including Ul snRNPs 
at the 5' splice site and U2 snRNP auxiliary faaor 
(U2AF) at the 3' splice site, are thought to give rise to 
preferred exon lengths because of steric constraints and 
geometry favoring interactions. In the case of small in- 
trons, a similar model of intron bridging has been pro- 
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Figure 1 Small internal exons and resplicing. This schematic 
figure indicates the pathway of resphdng demonstrated for the Dro- 
sophila Ubx locus (Hatton et a I. 1998). The thicker vertical line in- 
dicates a resphcing site, which does not contribute any nucleotides to 
the final mRNA product. The same pathway could be followed in the 
case of a microexon, in which case an arbitrarily small number of 
nucleotides would remain in the mRNA product. **Up. Exon" and 
"Down. Exon" denote the exons upstream and downstream of the 
resplicing site, respectively In the case of Ubx, the sequence imme- 
diately downstream of the resplicing site is an alternatively spliced 
exon {here designated "Alt. Exon"), but resplicing sites are not always 
accompanied by such alternatively spliced exons (J. Burnette and A. 
J. Lopez, personal communication). 

posed (Guo and Mount 1995; McCullough and Berget 
1997). In combination, these models suggest that, in 
order to be recognized, a splice site must have a partner 
an appropriate distance away, so that either exon defi- 
nition or intron definition is facilitated by the spacing. 
One experimental distinaion between exon definition 
and intron definition is the result of mutations that in- 
activate the splice site. Failure to undergo exon definition 
results in exon skipping, whereas failure to undergo in- 
tron definition results in intron retention. 

Not only is the use of one splice site dependent on the 
presence of its partner across the exon, but weakness in 
one partner can be compensated by strength in the other, 
as seen with second-site revertants of splice-site muta- 
tions that cause exon skipping. In an analysis of spUcing 
mutations at the dihydrofolate reductase locus, Caroth- 
ers et al. (1993) found that a mutation at the 5' splice 
site of exon 5 (G to C in the third position of the intron) 
could be partially reversed by mutations that increased 
the strength of the 3' splice site upstream of the same 
exon (AAAGI to TTAG|, ACAG|, or ATAG|). Al- 
though reversion was not complete, these data provide 
a strong argument that whether a sequence functions as 
a splice site depends not only on its intrinsic strength 
but also on its context. Similarly, there are mutations 
that create splice sites within introns, activating cryptic 
exons by recruitment of appropriately placed partners 
(e.g., see Bagnall et al. 1999). 

Splicing enhancers are sequences that stimulate 
splicing at nearby sites. A family of non-snRNP splic- 
ing factors known as "SR proteins" appear to be im- 
portant for the recognition of splicing enhancers in 
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exons (Blencowe 2000), A splicing difference between 
SMNl and SMN2, which explains their differential 
effects on spinal muscular atrophy, has been attrib- 
uted to a translationally silent substitution within the 
coding sequence that affects splicing (Lorson et al. 

1999) . Similarly, H.-X. Liu, L. Cartegni, Q. 
Zhang, and A. R. Krainer (personal communication) 
have shown that a nonsense mutation causing the 
skipping of BRCAl exon 18 affects splicing in vitro 
and that a missense mutation at the same position can 
also cause exon skipping. There are also spiicing-en- 
hancer sequences in introns — and examples of mu- 
tations that affect them (Cogan et al. 1997). Although 
general mechanisms for their function have yet to be 
defined, there is some evidence that at least some splic- 
ing enhancers in introns may act by facilitating exon 
definition in the case of small exons (Carlo et al. 

2000) . 

Outlook 

This review has presented aspects of pre-mRNA splicing 
that pose special problems for gene annotation. How- 
ever, even though the best gene finders predict genes 
exactly right less than half the time, 95% of total coding 
nucleotides are prediaed accurately, and <5% of genes 
are completely missed (Reese et al. 2000; Genome An- 
notation Assessment Project-GASPl). When cDNA and 
homology data are available, annotations will tend to 
be even better. Thus, one would be wrong to conclude 
from this review that the gene annotations attending the 
human genome sequence will not provide an extremely 
valuable resource. Nevertheless, molecular geneticists 
will want to have an understanding of the kinds of errors 
that are likely to occur — and to carefully review the 
available evidence for genes that matter to them. An- 
notators are likewise obligated to make the source of 
each specific aspect of their annotation an integral part 
of the annotation; for example, if part of the annotation 
is supported by a EST whereas the rest of it is based on 
the prediction of a gene finder, then the limits of the 
cDNA should be indicated, and the accession number 
of the EST should be part of the annotation. 

A related but distinct point is that these same factors 
are also relevant when candidate mutations are evalu- 
ated during the analysis of hereditary disease. Mutations 
that lie within splicing enhancers, at resplicing sites, or 
at cryptic splice sites can affect splicing even when they 
lie some distance from the splice sites actually used in 
the generation of the affected mRNA. The problem is 
further compounded by alternative splicing and the in- 
terplay between splicing and polyadenylation, topics 
that are beyond the scope of the present review. 

In summary, gene annotations will be a valuable re- 
source. However, they will not substitute for expertise 
in molecular genetics. 
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We constructed a library of synthetic promoters for Lactococcus lactis in which the known consensus 
sequences were kept constant while the sequences of the separating spacers were randomized. The library 
consists of 38 promoters which differ in strength from 0.3 up to more than 2,000 relative units, the latter among 
the strongest promoters known for this organism. The ranking of the promoter activities was somewhat 
different when assayed in Escherichia coli, but the promoters are efficient for modulating gene expression in this 
bacterium as well. DNA sequencing revealed that the weaker promoters (which had activities below 5 relative 
units) all had changes either in the consensus sequences or in the length of the spacer between the -35 and 
-10 sequences. The promoters in which those features were conserved had activities from 5 to 2,050 U, which 
shows that by randomizing the spacers, at least a 400-fold change in activity can be obtained. Interestingly, the 
entire range of promoter activities is covered in small steps of activity increase, which makes these promoters 
very suitable for quantitative physiological studies and for fine-tuning of gene expression in industrial biore- 
actors and cell factories. 



Metabolic engineering has promising perspectives with re- 
spect to improving the properties and performances of micro- 
organisms used as industrial bioreactors, as ceil factories, and 
in food fermentations. The importance of tuning gene expres- 
sion in this context, i.e., to perform metabolic optimization 
rather than massive overexpression or gene inactivation, is now 
far more appreciated. However, the more subtle approach of 
metabolic optimization is hampered by the lack of proper 
expression systems for tuning gene expression in many micro- 
organisms. Also, the fundamental understanding of a biologi- 
cal system through metabolic control analysis (5, 10) requires 
the tuning of enzyme activities in order to calculate the so- 
called control coefficients. For some organisms, expression sys- 
tems that allow for changing gene expression for scientific 
purposes and for a limited set of experimental conditions have 
been developed. Thus, for Escherichia coli, the lac system, the 
cl-regulated lambda p^/pu and many derivatives of these sys- 
tems have been widely applied, and such systems have also 
been adapted for use in other organisms (for a recent review, 
see reference 12). With respect to changing steady-state gene 
expression, these systems can sometimes be difficult to apply, 
particularly when it comes to changing gene expression on an 
industrial scale. Besides, in most food fermentation processes, 
the addition of chemicals as inducers of gene expression or the 
changing of other process parameters is not acceptable; in such 
cases, there are virtually no expression systems available for 
tuning gene expression and thus for performing accurate met- 
abolic optimization. 

Lactic acid bacteria are widely used in food fermentation, 
e.g., cheese and yoghurt production, but besides lactic acid, 
these bacteria excrete a spectrum of organic compounds. Some 
of these are desirable with respect to the development of 
texture and flavors or for bioconservation purposes, and some 
are undesirable for similar or different reasons. The lactic acid 
bacteria are therefore obvious candidates for attempts to op- 
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timize the pattern of formation of these compounds for specific 
applications. But the experimental tools for manipulating gene 
expression are not well developed for these bacteria. An ex- 
ception is the nisin-inducible system, developed recently by de 
Ruyter et al. (2). This system appears to be well suited for 
inducing gene expression in Lactococcus lactis by adding the 
antibiotic nisin (which is accepted as a food additive). A ques- 
tion that perhaps needs to be addressed in this context is 
whether the nisin expression system is also suitable for achiev- 
ing a steady level of gene expression. In addition, for effective 
metabolic optimization, it is often necessary to optimize the 
expression of a number of genes, which is not feasible with the 
systems developed so far. 

Here we describe a method for tuning steady-state gene 
expression in L lactis. We overcome many of the limitations 
discussed above by using libraries of synthetic promoters which 
cover a wide range of promoter activities and show that the 
strength of prokaryotic promoters can be modulated by ran- 
domizing the spacer sequences that separates the consensus 
sequences. The system is food grade and well suited for use in 
industrial bioreactors and food fermentation processes. In ad- 
dition, the system should be applicable to a broad range of 
biological systems. (Potential commercial users should be 
aware that the approach for obtaining the synthetic promoters, 
as well as the promoter sequences, were filed for patent world- 
wide [7a]). 

MATERIALS AND METHODS 

Bacterial strains and plasmids. Thc£. coli K-12 strain BOE270 (1) is highly 
competent with respect to transformation and was derived from strain MT102, 
which in turn ts an hsdR derivative of strain MClOOO [araD139 A{ara-leu)7679 
galU galK ^{lac)l74 rpsL thi-1 (la))]. BOE270 was used for studying promoter 
activities in £. coli as well as for cloning purposes and propagation of plasmid 
DNA in £, coli. The plasmid-free L. lactis subsp. cremoris strain MG1363, which 
docs not express p-galactosidase activity (4), was used for studying promoter 
activities in lactis. 

The promoter cloning vector pAKSO (7) was used for cloning the synthetic 
promoters DNA fragments. pAK80 is a shuttle vector for L. lactis and E, coli, 
conferring erythromycin resistance to the host cells. The vector carries the 
promoterless lacL and lacM genes from Leuconostoc lactis (which codes for 
3-gaIactosidase enzyme activity). It contains a multiple cloning site for the 
insertion of DNA fragments harboring putative promoter signals, just upstream 
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CP fragments 



pAKSO 



HNrSaXh Bll P Bi N Sc Xb ^^^^ 

C ^ — ' I I ' ' ' ' I ' — ' I ' -^M^^— pCP1-pCP29 

HNrSaXh All Ss N . Sc Hll P BI Sm Xb ^ ^ 
d „ I, ^^ ,1 ... I . 1 I I ^ .11 , , I . . . 

FIG. 1. Strategies used for cloning synthetic promoter fragments into the promoter cloning vector pAKSO. (a) Double-stranded DNA fragments carrying putative 
promoter activities, (b) Restriction map and schematic representation of the relevant parts of the promoter cloning vector. The stippled and solid lines show the 
strategies used for cloning pCPl through pCP29 and pCP30 through pCP46, respectively, (c) Restriction map of clones pCPl through pCP29. (d) Restriction map of 
clones pCP30 through pCP46. Note that a number of clones have been subject to cloning artifacts and thus may have a slightly different restriction map. BI, Bam HI; 
All. AflU; Ss, Sspl; N, Nsi] (Psn compatible); Nr, Nml; Sc, Seal; HII, HincU; P. Pstl; PII, FvuII; E, EcoRl; Sa, Sad; Xh, Xhol; BII, BgtU; Sm, Smal; Xb, Xba\ (not 
drawn to scale). 



the promo tcrless lacL and lacM genes from Leuconostoc lactis. Together, the 
lacL and lacM genes codes for a p-galactosidasc. 

Enzymes. Restriction enzymes, Klenow DNA polymerase, calf intestine phos- 
phatase, and T4 DNA ligase were obtained from and used as recommended by 
Pharmacia and New England Biolabs. 

Oligonucleotides. Oligonucleotides were obtained from Hobolth DNA Syn- 
thesis (Hillerod, Denmark). 

Second-DNA-strand synthesis. The single-stranded promoter oligonucleotides 
were converted to double-stranded DNA, using a 10-bp oligonucleotide (5'-CC 
GAATTCAG) complementary to the 3' end of the promoter oligonucleotide as 
primer for the second -strand synthesis by the KJenow fragment of DNA poly- 
merase I. 

Cloning of synthetic DNA fragments into the promoter cloning vector pAKSO. 
Two different cloning strategies were used (Fig. 1). In strategy A, the mbcture of 
DNA fragments was digested with two restriction enzymes, HincW and Sspl, and 
pAKSO was digested with Smal. In strategy B, the mbcture of DNA fragments was 
digested with two restriction enzymes, Bam HI and ft/I, and pAK80 was digested 
with Bgfll and Pstl. In both strategies, the promoter fragments were then ligated 
to the compatible vector fragments. The ligation mbaures were then transformed 
into Ca^"^ -competent cells (13) by using a standard transformation procedure 
(13), and the transformation mbcture were plated (at 30°C) on LB plates con- 
taining erythromycin (200 jLg/ml) and 5-bromo-4-chloro-3-indolyI-p-D-galacto- 
pyranoside (X-Gal; 100 ftg/ml). A total of 150 erythromycin-resistant transfor- 
mants were obtained; all were white initially, but after prolonged incubation (up 
to 2 weeks at 4*0), a number had become blue to various extents. Later, we 
discovered that the development of blue color from E. coli colonies (but not 
L. lactis colonies) expressing lacLM is greatly enhanced by adding 1% glycerol to 
the transformation plates (data not shown). Plasmids were isolated from these 
blue colonies, and it was confirmed by restriction enzyme analysis that most of 
these clones had promoter fragments inserted in the multiple cloning site of 
pAKSO, in the orientation that would dirert transcription into the p-galactosidase 
gene {lacLM). The 46 colonies isolated had become blue to various extents; 29 
from cloning strategy A (containing plasmids pCPl through pCP29) and 17 from 
strategy B (containing plasmids pCP30 through pCP46) were picked for further 
analysis. The two weakest promoter clones, pCP31 and pCP43, did not contain 
a promoter fragment, and four promoter clones, pCP18, pCP19, pCP33, and 
pCP44, turned out to be identical to pCP27, pCP22, pCP35, and pCP45, respec- 
tively. Indeed, the activities of these sets were almost identical, which also 
demonstrates the reproducibility of the assay used here. The chances that two 
identical sequences would have arisen by coincidence during the oligonucleotide 
synthesis is of course negligible, and these four clones must therefore be the 
result of a cell division that took place after the plasmids were transformed but 
before the cells were plated. 

Transformation of L. lactis. Cells of L bctis subsp. cremoris MG1363 (4) were 
made competent by growth overnight in GM17 medium containing 2% glycine as 
described by Holo and Ness (6). Plasmid DNA from the 46 clones described 
above was then transformed into these cells by electroporation (6). The cells 
were allowed to regenerate in SGMi7 medium for 2 h and then plated on SR 
plates containing erythromycin (2 ^tg/ml) and X-Gal (100 fig/ml), 

p-Galactostdase assay. The assay was done as described by Miller (14) and 
modified by Israelsen et al. (7). Cultures carrying the plasmid derivatives of 
pAK80 were grown in rich medium overnight at SO'C. The medium used for 



L. lactis was M17 medium supplemented with erythromycin (2 (ig/ml) and 1% 
glucose; for £. coli, LB medium supplemented with erythromycin (200 »ig/ml) 
was used. The results presented are averages of measurements of the activities of 
at least three individual cultures of each clone. The standard errors were less 
than 30% for E. coli activities and less than 20% for U lactis activities. Aliquots 
of 25 to 100 fil of the cultures were used in the p-galactosidase assay except in 
the case of the weakest promoter clones, where up to 2 ml of culture was 
concentrated and used in the assay. 

RESULTS 

The purpose of this work was to generate a library of syn- 
thetic constitutive promoters as a tool for genetic engineering 
of L lactis. The promoters should cover a wide range of pro- 
moter activities, in small steps of activity changes, so that they 
would be applicable to quantitative physiological studies and 
for metabolic optimization. The following strategy was used: 
(i) design and synthesize a degenerated oligonucleotide se- 
quence that encodes consensus sequences for L. lactis promot- 
ers, separated by spacers of random sequences; (ii) convert this 
mbrture of oligonucleotides to double-stranded DNA frag- 
ments, using DNA polymerase and a short oligonucleotide 
primer complementary to the 3' end of the degenerated oli- 
gonucleotide; and (iii) clone this mixture of DNA fragments 
into a promoter probing vector. The idea behind this strategy 
is that even though the consensus sequences should be impor- 
tant elements of an efficient promoter, the context in which the 
consensus sequences are located may modulate the strength of 
the promoters to some extent. 

Design and construction of synthetic promoters forL. lactis. 
A considerable number of promoters have been cloned and 
sequenced from L. lactis (see the review by de Vos and Simons 
[3]). From these data, we extracted extended consensus se- 
quence motifs for L. lactis promoters (Fig. 2A). The Pribnow 
box or the -10 sequence TATAAT and the -35 sequence 
TTGACA, known to be present in many prokaryotic promot- 
ers, are also well conserved for L. lactis. In addition, the se- 
quence TG is often found 1 bp upstream of the -10 sequence; 
it is also possible to determine a consensus sequence for the 4 
bp immediately upstream of the -35 motif, ATTC. Nilsson 
and Johansen (16) found well-cons erved sequences among 
promoters of the rRNA operons: AGTTT at position -44 and 
GTACTGTT at positions +1 to +8. In addition to these mo- 
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A. 

5' CATNNNNN AGTT TATTC TTGACA WHNNNNNNNNNNNHTGR TATAAT AHNWNA GTRCTGTT 3' 
-44 -35 -15 -10 +1 



B. 

Bamill Sspl Usil 

1 ..... . 

5 ' CGGGATCCTTAAGAATATTATGCATNNNNNAGTTTATTCTTGACANNNNNNNNNNNNNNT 

-44 -35 -15 

Pstl EcoRl 
Seal Hindi PvuU 

GR TATAAT ANNWNA GTACTGTT AACTGCAGCTGAATTCGG 3' 
~ -10 +1 

FIG. 2. Oligonucleotide sequence used for the generation of a library of 
synthetic promoters for L. lactis. (A) Consensus sequence for L. lactis promoters 
derived from data published in the literature. N = 25% each A, C, G, and T; R = 
50% each A and G; W = 50% each A and T. (B) The design of the oligonu- 
cleotide. The sequence contains a number of recognition sequences for restric- 
tion endonucleases, for use in the subsequent cloning strategy. Note that the 
sequence from positions +1 to +8, which is a putative stringent response site, 
can be deleted in the cloning process if necessary. See text for further details. 



tifs, two semiconserved base pairs were included, R (=A or G) 
upstream of the -10 sequence and W (=A or T) at position 
-3. Based on these data, we designed an oligonucleotide 
which also encodes recognition sites for multiple restriction 
enzymes (Fig. 2B). This mixture of oligonucleotides was con- 
verted to double-stranded DNA fragments, using a short 
primer complementary to the 3' end. Finally, the resulting 
double-stranded DNA fragments, encoding potential pro- 
moter structures, were cloned into the polylinker on the pro- 
moter probe vector, pAK80 (7), upstream of the promoterless 
P-galactosidase gene, using E. coli as a host; this resulted in 
plasmids pCPl through pCP46. 



Activities of the synthetic promoters in L. lactis. Plasmids, 
pCPl through pCP46 were then transformed into L, lactis 
subsp. cremoris MG1363. The different plasmids gave rise to 
colonies exhibiting very different intensities of blue on plates 
containing X-Gal. The specific activities of p-galactosidase in 
liquid cultures of these clones were then determined (Fig. 3) 
and found to vary from 0.3 Miller unit, or from slightly above 
the activity found with the cloning vector pAK80 without any 
insert, to up to more than 2,000 Miller units. Together, the 
promoters covered 3 to 4 logs of promoter activities in small 
steps of activity change. 

Sequence analysis of the CP promoters. A very interesting 
point is the molecular basis for the differences in strength of 
the CP promoters, and we therefore took on the task of se- 
quencing the promoter clones. Eighteen clones were perfect in 
the sense that they had the DNA sequence that was specified 
by the oligonucleotide (Fig. 4). The activities of these 18 pro- 
moter clones covered, in small steps of activity change, a 50- 
fold range of activity, from 34 up to 1,800 Miller units. Four of 
the CP promoters had a 16-bp spacer between the -35 and 
-10 sequences instead of the 17 bp specified in the oligonu- 
cleotide sequence, and the activities carried by these four 
clones were weak, ranging from 0,7 to 12 Miller units. Four 
clones had base pair changes in the -35 sequence, and two had 
base pair changes in the -10 sequence; those clones also had 
rather weak activity (0.3 to 69 Miller units). 

Some clones had 1-bp deletions or a base pair change out- 
side the -35 to -10 region or have been subject to other 
cloning artifacts. However, the activities of these promoter 
clones were all within the range covered by the perfect clones, 
i.e., activities from 58 to 2050 Miller units, which indicates that 
in this case, consensus sequences outside the -35 to -10 
sequence are of little importance with respect to determining 
the promoter strength. 
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FIG. 3. Library of synthetic promoters for L, lactis. Promoter activities (Miller units) were assayed from the expression of a reporter gene (iacLA/) encoding 
p-galactosidase transcribed from the different synthetic promoter clones on the promoter cloning vector pAK80, The patterns of the data points indicate which 
promoter clones contain errors in either the -35 or the -10 consensus sequence or in the length of the spacer between these sequences. 
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Promoter Sp«c. P-gal activity 

Ol i goBeq CATNNNl^AOTTTATTCTTOACANimNNNNNNNNNN^ 

C P2 4 CATGGGTAAOTTTATTCTTCAOICTATCTGGGCCCGATGOTATAXTAAGTGACTACTO 
CP18=CP27 CATTTTGCAOTTTATTCTTOAaiTTGTGTGCTTCGGGTO-TATAATA-C^ 

C P3 7 CATCATTAAOTTTATTCTTCAO^TTGGCCGGAATTGTTO^ATAATACOTAOT 

C P 1 7 CATTCTGGAOTTTATTCTTOAC-CGCTC AGTATGCAGTOgTATAATAGTACAOTACT^ 

CP2 CATTTGCTAOTfrTATTCTTGAaiTGAAGCGTGCCTAATCGTATArrACTTGAGTJ£TOm 

C P4 GATGTTTTAOTTTATTCTTGACACCGTATCGTGCGCGTGATATAATCGGGATCCTTAAGA 
CP44=CP45 CATCGGGTAOTTTATTCTTGACA- ATTAAGTAGAGCCTOATATAATAGTTCAOTAC'POTT 

C P 1 CAT ACCGGAGTTTATTCTTOACAGTTCC ACCTCGGGTT<»TATAATATCTCAOTACTOTT 
CP19=CP22 CATCGCTTAGTTT-TTCTTOACAGGAGGGATCCGGGTTOATATAATA-GTTAaTACTOra 

CP34 CATCGCGAAOTTTATTCTTCACACACCGCAGAACTTGTOOTATAATACAACAOTACTOTT 

CP2 0 CATGGGTGAGTTTATTCTTaACAGTGCGGCCGGGGGCTOATATCATAGCAGAaTACTATT 

CPU OlTAAGTGAGTTTATTCTimCCCGGACGCCCCCCTTTGMATAATAAGT-J^^ 

CP26 CATTCTACAGTTTATTCTTOACATTGCACTGTCCCCCTGOTATAATAACTATACATGCAT 

CP3 CATCCTGTAGTTTATTCTTGACACAAGTCGTTAGCTGTOOTATAATAGGAGAGTACTOTT 

CPl 4 CATGACGGAGTTTATTCTTGACACAGGTATGGACTTATGATATAATAAA^^ 

CPl 3 CATGCTTTACTTTATTCTTOACAAAACCACCAGCTTTTOGTATAATACGTGAOJ^ 

CP40 CATAGAACA0TTTATTCTT<»CATTGAATAAGAAGGCTOATATAATAGC-CA0TACTOTT 

CPS CATTCTTTAOTTTATTCTTGACAAACGTATTGAGGACTaATATAATAGGTGAOTACTOTT 

CP2 8 CATGGGGCCGTTTATTCTTGACAACGGCGAGCAGACCTOGTATAATAATATAOTACTOTO 

CPIO CATGGCTTAOTTTATTCTTGACAGGGTAGTATCACTGTGATATAATAGGACAOTACTaTO 

CP32 CATACGGGA0TOTATTCTK»CATATTGCCGGTGTGTTOOTATAATAACTTAGTACTOTT 

CP3 0 CATGACAGAGTTTATTCTTOACAGTATTGGGTTACTTTGOTATAATAGTTGAGTACraTT 

CP9a CATAGTCTAGTTTATTCTTGACJICGCGGTCCATTGGCTGGTATAATAATTTAGTACTOTT 

C P 3 8 CATAGAGAAOTTTATTCTTGACAGCTAACTTGGCCTTTGATATAATACATGAGTACTaW 

C P4 6 CATGATGTAGTTTATTCTTGACACTGAGAGGGCCTCTTaATATAATAGTTGAOTACTOTT 

C P2 3 CATGTAGGAGTTTATTCTTaACAGATTAGTTAGGGGGTGGTATAATATCTCAGTACTGTT 

CP3 9 CATTGCGAAGTTTATTCTTGACAGTACGTTTTTACCATaATATAATAGTATAaTACTOTT 
CP33=CP35 CATGTTGGAGTTTATTCTTGACATAC AATTACTGCAGTOATATAATAGGTGAaTACTCCT 

C P 1 5 CATTACGTAGTTTATTCTTGACAG AATTACGATTCGCTGGTATAATATATCAOTACTOCT 

CP29 CATCGGTAAG-TTATTCTTGACATCTCAGGGGGGACGTOOTATAATAACTGAOTACTOCT 

C PI 2 b CATATACAAGTTTATTCTTGACACTAGTCGGCCAAAATGATATAATACCTGAOTACT 

CP4 1 CATCCGCAAOTTTATTCTTGACAGCTGAATGTAGACGTGGTATAATAGTTAAOTACTaCT 

C PI 6 aLTTGTGTAGTTTATTCTTGACAGCTATGAGTCAATTTGGTATAATA- -ACAOTACTCAG 

C P4 2 CATTCGTAAGrPTATTCrwaACACCTGAGATGAGGCGTGATATAATAAATAAOT^ 

C P7 TATGCGGTAOTTTATTCTTGACATGACGAGACAGGTGTOGTATAATGGGTCTAGATTAGG 

CP6 CATGTGGGAOTTTATTCTTGACACAGATATTTCCGGATGATATAATAACTGAaTACTOTT 

CP2 5 c-TTTGGCAOTTTATTCTTGACATGTAGTGAGGGGGCTOGTATAATCACATAaTACTOTT 

FIG 4, Sequence of the area from positions -52 to +8 (relative to the putative transcription initiation site) of the synthetic promoter clones pCPl through pCP46. 
The clones are ordered according to strength. Matches to the oligonucleotide consensus sequence (given at the top) are in boldface. Errors in the -35 or -10 consensus 
sequence and deletions in the spacer between these sequences are underlined. Two clones, CP9 and CP12, had two promoter fragments mserted in tandem, a (upstream 
fragment) and b (downstream fragment). In these cases, only one of the two tandem promoters was perfect; data for these promoters are shown, p-gal. p-galactosidase. 
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Regulation of promoter activities. The synthetic CP promot- 
ers were designed to be constitutive. To test this experimen- 
tally, the expression in exponential growth phase and station- 
ary growth phase was measured for a selection of the promoter 
clones. We found that the specific activity of p-galactosidase 
was two- to fourfold higher in the stationary-phase cultures 
than in the exponential-phase cultures (data not shown). How- 
ever, the copy number of the vector used in these studies has 
been shown to increase approximately threefold in the station- 
ary phase (11), which demonstrates that the CP promoters are 
indeed quite close to being constitutive under these conditions. 

Activities of the synthetic promoters in E. coli. Another 
interesting point is whether the promoters are functional in 
other organisms, and if so, whether the relative strength of the 
promoters would be dependent on the organism. As described 
above, the promoter cloning vector, pAK80, that we used here 
for construction of the synthetic promoters also replicates in 
E. coli; indeed, the promoter clones were first isolated in E. 
coli. We could therefore measure the activities of the synthetic 
promoters also in E, coli (Fig. 5). The promoter strength was 
also highly variable for the individual promoters in this organ- 
ism, and we found that the promoters covered activities from 
0.2 to 500 Miller units. In this case also, the activity increased 
in small steps. 

The absolute values of p-galactosidase units measured in 
E, coli were lower on average compared to L. lactis; this was 
probably a consequence of a low efficiency of translation of the 
lacL and lacM genes in E. coli, since these genes and their 
ribosome binding sites originate from the gram-positive bacte- 
rium Leuconostoc mesenteroides. When some of the strongest 
promoters were cloned into a promoter cloning vector de- 
signed for E. coli, the promoters turned out to be quite strong 
(data not shown). 

Figure 6 shows a plot of activity of the CP promoters in 



L lactis and E. coli. The strengths of the individual CP pro- 
moters in the two organisms correlate somewhat but not very 
well: some promoters which were quite strong in L, lactis were 
relatively weak in E. coli, and vice versa. Moreover, the pattern 
that we observed in L. lactis, i.e., that the relatively strong 
promoters were the perfect ones, did not hold true for E. coli: 
here the promoters which had either an error in the consensus 
sequence or a shorter spacer were relatively strong. 

DISCUSSION 

We have constructed a library of synthetic promoters that 
differ in strength over 3 to 4 logs of activity, and this range of 
activity is covered by small steps of activity increase. Moreover, 
some of the promoters that resulted from this random ap- 
proach turned out to be quite strong. 

The fact that the library of promoters covered such a wide 
range of activities was somewhat surprising to us; the under- 
lying idea behind the construction of the CP promoters was 
that the context of the consensus sequences (the spacers) 
would play a role in modulating the strength of a promoter, 
rather than changing the activity over several logs of activity. 
Indeed, much of that variation (below 5 Miller units) was 
probably a consequence of the accidental introduction of mu- 
tations in the consensus sequences and in the length of the 
spacer regions. In contrast, the strong promoters in L. lactis 
(those having activities higher than 100 Miller units) were all 
perfect with respect to the consensus sequence and spacer 
length. But even when we confine our analysis to these pro- 
moter clones, we find 400-fold variation in promoter activity, 
still in small steps of activity increase, which demonstrates that 
the context in which the consensus sequences are embedded 
(i.e., the spacers) clearly is important for promoter strength. 

The ranking of the promoters depended on the organism in 
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FIG. 5. p-Galactosidasc activities of the CP promoters in £. coU. The promoter activities were assayed from the expression of a reporter gene (lacLAf) encoding 
p-galactosidase transcribed from the different synthetic promoter clones on the promoter cloning vector pAKSO. The patterns of the data points indicate which 
promoter clones contained errors in either the -35 or the -10 consensus sequence or in the length of the spacer between these sequences. See text for further details. 



which they were measured, possibly because the ct factor-RNA 
polymerase complexes that recognize these promoters have 
different structures in the two organisms due to differences in 
amino acid sequences. The fact that E. coli accepted some of 
the less perfect CP promoters as relatively strong promoters 
could indicate that E, coli is more promiscuous with respect to 
promoter structure than L. lactis. This makes some sense con- 
sidering the composition of the L, lactis genome: the AT con- 
tent is 65%, which is much closer to the base composition of 
the -35 and -10 consensus sequences. These sequences are 
therefore more likely to accidentally occur in L. lactis, and a 
stricter requirement for promoter sequences might therefore 
be expected for this organism. 

The process of transcription initiation consists of several 
events (reviewed in reference 17). First, recognition and bind- 
ing of the a factor-RNA polymerase complex to the promoter 
region takes place (closed complex formation). Subsequently, 
there is local melting of the DNA double heluc (open complex 
formation), possibly assisted by local negative DNA supercoil- 
ing. Finally, the binding between the a factor-RNA polymerase 
complex and the promoter area must dissociate and clear the 
promoter area, so that another initiation complex may form. 
From this model, it is clear that efficient binding between the 
a factor-RNA polymerase complex and the promoter area 
does not guarantee a strong promoter; promoter strength must 
be a compromise between binding, melting, and clearance, and 
probably other factors as well. 

What then controls the strength of the individual synthetic 
promoters presented here? It does not appear that any addi- 
tional conserved sequence motifs have been generated among 
the strongest promoters. Rather, it seems that the overall 
three-dimensional structure which arises from a particular nu- 
cleotide sequence could be important. 

The method presented here for tuning gene expression in 



the living cell has both advantages and disadvantages com- 
pared to the methods that would use an inducible expression 
system such as the lac promoter. A disadvantage is that instead 
of only one genetic construct, perhaps three to four constructs 
have to be made. On the other hand, the constructs are made 
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FIG. 6. Correlation between promoter activities in L. lactis and E. coU, The 
promoter activities measured in E. coli (from Fig. 5) were plotted as a function 
of the promoter activities measured in L. lactis (from Fig. 3). The symbols 
indicate errors in either the -35 or -10 sequence (solid circles), a 16-bp spacer 
(triangles), or promoters with both of these errors (diamonds). The open square 
represents the vector clone. 



Vol. 64, 1998 



SYNTHETIC PROMOTERS 87 



in parallel, so that the amount of work should not be propor- 
tional to the number of constructs. The inducible systems have 
the advantage that gene expression can be turned on at the 
proper time during a fermentation, which is sometimes essen- 
tial (for instance, when the product is toxic to the host cell). 
The work presented here was aimed at generating a library of 
constitutive promoters, for achieving a constant level of gene 
expression throughout the grovrth of a culture. We are cur- 
rently working on synthetic inducible promoters in which a 
regulatory motif has been added. This should allow us to gen- 
erate libraries of promoters, which differ in basal expression 
level and can be induced to various extents, by changing a 
fermentation parameter (i.e., temperature, pH, or salt concen- 
tration) or by adding a specific inducer. 

The system presented here also has advantages. One is that 
it is easier to attain a steady expression level of the enzyme in 
question, which is often quite difficult with inducible systems 
such as the lac system (8). With the method presented here, 
once the optimal expression level of the enzyme has been 
determined, the engineered strain is ready to use directly in the 
fermentation process. 

An important feature of the system described here, in a 
longer perspective, is the possibility to simultaneously modu- 
late, to different extents, the expression of several individual 
genes or operons located at various positions of the genome in 
the same strain. Metabolic control analysis (5, 10) showed that 
in theory, flux and concentration control can be shared among 
several enzymes in a pathway, and experimental determina- 
tions of flux control have often showed that control seems to be 
distributed over many enzymes in the living cell (9, 15, 18, 19, 
22, 23): in most cases, there may not be such a thing as a 
rate-limiting step, and even if one finds a step that has a 
measurable control, the control will often disappear relatively 
quickly as the enzyme is being overexpressed. Since the sum of 
flux control must equal unity, this then means that flux control 
has been shifted to other steps in the pathway. In summary, in 
order to increase a given flux in a living cell, it may thus be 
necessary to (i) optimize the individual expression of several 
genes and (ii) after one round of optimization in which one 
enzyme was clamped at the optimal level, continue the opti- 
mization of other enzymes in the pathway. With the systems 
available until now, one would then quickly run out of expres- 
sion systems to use, but with our method, one can in principle 
continue the optimization numerous times. 

In this report, the method for generating synthetic promot- 
ers of different strengths was illustrated for use in the gram- 
positive bacterium L lactis. However, there is no obvious rea- 
son why the approach should be limited to this organism, and 
the fact that the same promoter library was also functional in 
the gram-negative bacterium E. coli suggests that the approach 
may be universally applicable to prokaryotic organisms. An 
exciting question is then, can the approach be extended to 
work for modulating gene expression in eukaryotic cells? Such 
experiments are under way, and the results are quite encour- 
aging. 

ACKNOWLEDGMENTS 

We are deeply indebted to Regina Schurmann for excellent techni- 
cal assistance. 



This work was funded by the Danish Centre for Advanced Food 
Studies. 



REFERENCES 

1. Boe, L. Personal communication. 

la.Casabadan, M. J., and S. N. Cohen. 1980. Analysis of gene control signals by 
DNA fusion and cloning in Escherichia coli. J. Mol. Biol. 138:179-207. 

2. de Ruyter, P. G., O. P. Kuipers, and W. M. de Vos. 1996. Controlled gene 
expression systems for Lactococcus lactis with the food-grade inducer nisin. 
Appl. Environ. Microbiol. 62:3662-3667. 

3. de Vos, W. M., and G. Simons. 1994. Gene cloning and expression systems in 
lactococci, p. 52-105. In M. J. Gasson and W. M. de Vos (ed.), Genetics and 
biotechnology of lactic acid bacteria. Black ie Academic & Professional, 
Glasgow, United Kingdom. 

4. Gasson, M. J. 1983. Plasmid complements oi Streptococcus lactis NCDO 712 
and other lactic streptococci after protoplast-induced curing. J. Bacteriol. 
154:1-9. 

5. Heinrich, R., and T. A. Rapoport. 1974. A linear steady-state treatment of 
enzymatic chains: general properties, control and effector-strength. Eur. 
J. Biochcm. 42:89-95. 

6. Holo, H., and I. F. Nes. 1989. High-frequency transformation, by electropo- 
ration, of Lactococcus lactis subsp. cremoris grown with glycine in osmotically 
stabilized media. Appl. Environ. Microbiol. 55:3119-3123. 

7. Israelsen, S. M. Madsen, A Vrang, E. B. Hansen, and E Johansen. 1995. 
Cloning and partial characterization of regulated promoters from Lactococ- 
cus lactis Tn917-lacZ integrants with the new promoter probe vector, pAKSO. 
Appl. Environ. Microbiol. 61:2540-2547. 

7a Jensen, P. R. 1997. International patent application PCT/DK97/00342. 

8. Jensen, P. R, H. V. Westerhoff; and O. Michelsen. 1993. The use of ftic-type 
promoters in control analysis. Eur. J. Biochem. 211:181-191. 

9. Jensen, P. R., H. V. Westerhoff, and O. Michelsen. 1993. Excess capacity of 
H'^-ATPasc and inverse respiratory control in Escherichia coli. EMBO J. 
12:1277-1282. 

10. Kacser, H., and J. A. Bums. 1973. The control of flux. Symp. Soc. Exp. Biol. 
27:65-104. 

11. Madsen, P. L. 1996. Transcription of the lactococcal temperate phage 
TP901-1. Ph.D. thesis. Department of Biological Chemistry, University of 
Copenhagen, Copenhagen, Denmark. 

12. Makrides, S. C. 1996. Strategics for achieving high-level expression of genes 
in Escherichia coli. Microbiol Rev. 60:512-538. 

13. Maniatis, T., E. F. Fritsch, and J. Sambrook. 1982. Molecular cloning: a 
laboratoiy manual. Cold Spring Harbor, Cold Spring Harbor Laboratory, 
N.Y. 

14. Miller, J. H. 1972. Experiments in molecular genetics. Cold Spring Harbor 
Laboratory Press, Cold Spring Harbor, N.Y. 

15. Nicderberger, P., R. Prasad, G. Mlozzari, and H. Kacser. 1992. A strategy 
for increasing an in vivo flux by genetic manipulations. Biochem. J. 287:473- 
479. 

16. NUsson, D., and E. Johansen. 1994. A conserved sequence in tRNA and 
rRNA promoters of Lactococcus lactis. Biochim. Biophys. Acta 1219:141- 
144. 

17. P£rez-Martin, J., F. Rojo, and V. de Lorenzo. 1994. Promoters responsive to 
DNA bending: a common theme in prokaryotic gene expression. Microbiol. 
Rev. 58:268-290. 

18. RoUter, G. J. G., P. W. Postma, and K. van Dam. 1991. Control of glucose 
metabolism by enzyme ll*^''' of the phosphoenolpyruvate-dependent phos- 
photransferase system in Escherichia coli. J. Bacteriol. 173:6184-6191. 

19. Schaaff, L, J. Heinlsch, and F. K. Zimmermann. 1989. Overproduction of 
glycolytic enzymes in yeast. Yeast 5:285-290. 

20. Schickor, P., W. Metzger, W. Werel, H. Lederer, and H. Heumann. 1990. 
Topography of intermediates in transcription initiation of £. coli. EMBO J. 
9:2215-2220. 

21. Schneider, K., and C. F. Beck. 1986. Promoter-probe vectors for the analysis 
of divergently arranged promoters. Gene 42:37-48. 

22. Snoep, J. L., L. P. Yomano, H. V. Westerhoff, and L. O. Ingram. 1995. 
Protein burden in Zymomonas mobilis: negative flux and growth control due 
to overproduction of glycolytic enzymes. Microbiology 141:2329-2337. 

23. Walsh, K., and D. E. Koshland, Jr. 1985. Characterization of rate-controlling 
steps in vivo by use of an adjustable expression vector. Proc. Natl. Acad. Sci. 
USA 82:3577-3581. 



Journal of Bacteriology, Oct. 1995, p. 5740-5747 
0021-9193/95/$04,00+0 

Copyright © 1995, American Society for Microbiology 



Vol. 177, No. 20 



Nucleotide Sequence, Transcriptional Analysis, and 
Glucose Regulation of the Phenoxazinone Synthase 
Gene (phsA) from Streptomyces antibioticus 

CHUIN-JU HSIEH and GEORGE H. JONES* 
Department of Biology, Emory University, Atlanta, Georgia 30322 

Received 27 June 1995/Accepted 8 August 1995 

The nucleotide sequence of a 2.3-kb Sphl fragment containing the structural gene (phsA) for phenoxazinone 
synthase (PHS) of Streptomyces antibioticus was determined. The sequence was found to contain an open 
reading frame (ORF) with a G+C content of 71.5% oriented in the direction of transcription that was 
confirmed by primer extension. The ORF encodes a protein with an of 70,223 consisting of 642 amino acids 
and is preceded by a potential ribosome-binding site. The codon usage pattern is in agreement with the general 
pattern for streptomycete genes, with a 92.5 moI% G+C content in the third position. The N-terminal sequence 
of the mature PHS subunit corresponds exactly to that predicted from the nucleotide sequence. Neither ATG 
nor GTG initiator codons were identified for the protein. However, a TTG codon was located near the amino 
terminus of the mature protein and is a good candidate for the initiator codon. The transcriptional start point 
of phsA was located 36 bp upstream of the start codon by primer extension. The -10 region of the putative 
promoter showed some similarity to the consensus sequence for the m^jor class of prokaryotic promoters, but 
the —35 region was less similar. Comparison of the primary amino acid sequence of PHS of 5. antibioticus with 
other amino acid sequences indicated that PHS is a blue copper protein with copper binding domains in the 
N-terminal and C-terminal regions of the polypeptide chain. A ^^prBI fragment containing the promoter region 
of phsA and a portion of the ORF was shown to promote xylE expression when cloned in the streptomycete 
promoter probe vector pU2843. This phsA promoter-dependent xylE expression could be repressed by glucose 
in S. antibioticus when the organism was grown on glucose or galactose plus glucose. Thus, the cloned promoter 
region appears to contain the sequences responsible for catabolite repression of PHS production. 



Actinomycin is one of the antibiotics produced by the gram- 
positive actinomycete Streptomyces antibioticus (52). A putative 
pathway for actinomycin biosynthesis was proposed several 
years ago (50), and biochemical, physiological, and genetic 
studies have confirmed the essential details of that pathway. 
Five enzymes from 5. antibioticus, Streptomyces chrysomallus, 
and Streptomyces parvulus have been isolated and character- 
ized to demonstrate their involvement in the actinomycin bio- 
synthetic pathway (6, 11, 22, 23, 28-31). One of these enzymes, 
phenoxazinone synthase (PHS), catalyzes the oxidative con- 
densation of two molecules of 4-methyl 3-hydroxyanthraniloyl 
pentapeptide to form actinomycinic acid, which is the penul- 
timate intermediate in the putative biosynthetic pathway (Fig. 
1), The enzyme was first identified by Katz and Weissbach (28) 
and subsequently purified by Choy and Jones (6). To date, 
phsA, the gene coding for PHS from 5. antibioticus, is the only 
gene involved in actinomycin biosynthesis that has been cloned 
(25). 

Although essentially all of the enzymes required for actino- 
mycin production have been identified, little is known about 
the regulation of these enzymes and of overall actinomycin 
production. Of all the enzymes identified, PHS is perhaps the 
best characterized. It was shown some years ago by Marshall 
and coworkers that actinomycin production is repressed in S. 
antibioticus cultures grown on glucose or galactose plus glucose 
as compared with cultures grown on production medium with 
galactose alone as the carbon source (39). 

Catabolite control has been implicated in the expression of 
both PHS and actinomycin synthetase I (ACMSI; the enzyme 
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which activates the precursors of the actinomycin chro- 
mophore in S. antibioticus [23, 30, 31]). PHS production was 
demonstrated to be subject to catabolite control shortly after 
the identification of the enzyme (13, 28). It is possible that 
phsA and acmsl are located in the same genomic region in S. 
antibioticus since it was well known that the genes for antibiotic 
production are clustered in the streptomycete genome (for 
examples, see reference 38). Therefore, the detailed molecular 
analyses of the mechanisms controlling the expression ofphsA 
are essential to our understanding of the regulation of actino- 
mycin biosynthesis and the synthesis of other antibiotics. We 
report here the nucleotide sequence and transcriptional anal- 
ysis of phsA and identify the promoter region of the gene. We 
also demonstrate that a cloned fragment containing the puta- 
tive promoter is active in a streptomycete promoter probe 
vector and that the activity of the promoter is repressed when 
S. antibioticus transformants containing the relevant constructs 
are grown on glucose or galactose plus glucose as compared 
with cultures grown on galactose as the sole carbon source. 

MATERIALS AND METHODS 

Organisms and growth conditions. The Streptomyces strains used were S. 
antibioticus IMRU 3720 and Streptomyces lividans 66 derivative TK24 (18). S. 
antibioticus was grown on liquid NZ-amine and galactose-glutamic acid media as 
described previously (13). S. lividans was generally grown on yeast extract-malt 
extract plus 34% sucrose (YEME) or on tryptonc soy broth. For protoplast 
preparation, TK24 was grown on YEME with MgClj and glycine at the final 
concentrations of 5 mM and 0.5%, respectively (49). Protoplasts were allowed to 
regenerate on R2YE medium (49) for 12 to 24 h and then overlaid with 2 to 3 
ml of soft nutrient agar supplemented with thiostrepton at a final concentration 
of 500 |ig/ml. Tyrosine at 0.075% (wtA^ol) was added to the soft nutrient agar for 
overiaying when pU702 derivatives were used. 

Escherichia coU DH5a [F' <(»80 dlacZMlS {lacZYA-argF) 0169 endAl recAl 
hsdRJ? (r^"" mK^^) deoR thi-l supE441 gyrA96 reUl] and Xl^l Blue 2 [recAl 
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FIG. 1. The PHS reaction. The penultimate step in the actinomycin biosyn- 
thetic pathway in 5. antihioticus, the oxidative condensation of two molecules of 
4-melhyl 3-hydroxyanlhraniloyl peniapeptide to yield actinomycinic acid, is cat- 
alyzed by PHS. 



endAl gyrA96 thi-l lisdRll supE44 relAl be (F' proAB kcV^Z MIS TniO (Tef)] 
were generally cultured in L broth or on L agar (35). E. co/i-competent cells were 
prepared by the CaCl2 method and transformed as described by Sambrook et al. 
(46). After transformation with pUC19 and pBluescript SK"^ derivatives, trans- 
formants were selected on L agar plates containing 100 y% of ampicillin per ml, 
40 mg of 5-bromo-4-chloro-3-indolyl-p-D-galactopyranoside (X-Gal) per ml, and 
0.2 mM isopropyl-p-i>-thiogalactopyranoside (IPTG). For single-stranded DNA 
preparation, strains containing pBluescript SK"^ derivatives were grown in 2XYT 
medium in the presence of 100 jig of ampicillin per ml and helper phage 
VCSM13 for 2 h followed by the addition of 75 ng of kanamycin per ml and 
growth overnight. Growth temperatures for Streptomyces spp. and E. coli were 30 
and respectively. 

DNA manipulations. Plasmid and chromosomal DNAs were prepared as de- 
scribed previously (4, 17, 20) and analyzed by restriction digestion and agarose 
gel electrophoresis. In some experiments, restriction fragments were recovered 
from low-melting-point agarose as described by Favre (10). Protoplast prepara- 
tion, transformation, and regeneration were as described previously (17, 20, 25). 
A list of plasmids used or generated in the present study is provided in Table 1. 
pJSE923 is a derivative of pIJ2501 (2.5) with an Xbal linker inserted at the Pvul 
site of phsA. pJSE929 contains the blunt-ended RsrBl subfragment of the phsA 
promoter region cloned into the Hindi site of pUC19. pJSE935 contains the 
//mdlll-ZJamHI subfragment of the phsA promoter region of pJSE929 cloned 
into //jTidlll-Bfl/nHI-digested pU2843 (7). 

Enzyme assays. Streptomyces cultures were grown in 250-ml flasks containing 
50 ml of glutamic acid-salts medium, 50 p-g of thiostrepton per ml as necessary, 
and 5 mM CuS04 at 28''C with shaking at 200 rpm. Cultures contained either 1% 
galactose, 1% glucose, or 0.5% galactose plus 0,5% glucose as carbon sources. 
The cultures were harvested 12 h after inoculation. Mycelium was washed in 100 
mM potassium phosphate (pH 7.5), suspended in a final volume of 2 ml of 
sample buffer (19), and disrupted by sonication. 

Catechol dioxygenase assays were performed and activities were determined 
spectrophotometrically as described previously (19, 54). Catechol dioxygenase 
specific activity was calculated as the rate of change in /1 375 per min per milligram 
of protein and converted to milliunits per milligram (45). Protein concentrations 
were determined with the bicinchoninic acid protein assay reagent kit from 
Pierce. The PHS assay was performed as described previously (6) with 3-hy- 
droxyanthranilic acid as the substrate. 

Nucleotide sequence analysis. Sequential deletion clones from both ends of 
the phiA Sphl fragment were obtained by exonuclease Ill-mung bean nuclease 
digestion with the exonuclease Ill-mung bean deletion kit from Stratagene Clon- 



ing Systems. The phsA Sphl fragment was subcloned into pBluescript SK"^ 
(Stratagene) modified to contain an Sphl site in the polylinker, and the resulting 
recombinant plasmids (pJSE900 and pJSE910) were used to create deletion 
clones suitable for sequencing. The nucleotide sequences of both DNA strands 
of the cloned phsA fragment were obtained by the dideoxy chain termination 
method (47). Single-stranded DNA was obtained with VCSM13 as a helper 
phage, and the DNA was prepared as described previously (26). The sequencing 
reactions were performed basically as described for the 7-dea2a-GTP Sequenase 
kit from United Slates Biochemicals except that the extension and termination 
reactions were done at 50 and 70''C, respectively. The reactions were post- 
terminated at 70'C for 2.5 min by adding 2.5 U of Tag version 2.0 DNA 
polymerase and 1 (il of termination mixture, both from United States Biochemi- 
cals. Difficult compression areas and pause sites were resolved by using dITP 
instead of deaza-GTP. The DNA sequences were analyzed with the DNAsis 
program from Hitachi and the GCG program from the University of Wisconsin. 

The GenBank accession number for the S. antihioticus IMRU3720 PHS gene 
iphsA) is U04283. 

Primer extension. In the primer extension experiments, a 24-base oligonucle- 
otide primer, 5'-GATCTCGGTCTCCCGCGTCACCTC-3', that Is located 528 
bp downstream of the 5' -Sphl site and is complementary to the phsA mRNA was 
used to reveal tlie transcriptional start point. End labeling of the 5 '-terminus of 
the oligonucleotide primer with the polynucleotide kinase reaction and the 
primer extension reaction were done as described by Moran (42). RNA prepa- 
ration was as described previously (17) with the following modifications. Myce- 
lium was collected on a Whatman no. 4 filter disc by use of a vacuum line to 
accelerate the filtration process. The mycelium was quickly scraped off the filter 
into a universal bottle and resuspended in 5 ml of modified Kirby mixture at 4^ 
(modified Kirby mixture consists of 1% [wt/vol] sodium triisopropylnaphthalene 
sulfonate [Eastman Chemicals], 6% [wt/vol] sodium-4-amino salycilic acid [so- 
dium salt; BDH], and 6% [vol/vol] Tris-EDTA-buffered phenol mixture, and all 
solutions were made up in 50 mM Tris-HC! [pH 8.3]). The contents were 
vortexed with 10 g of 4.5- to 5.5-mm-diameter glass balls as vigorously as possible 
for at least 2 min. Three milliliters of phenol-chloroform mixture was added, and 
the mixture was vortexed as described above. The homogenate was then trans- 
ferred to a polypropylene tube (Falcon 2006) and centrifuged (10 min at 12,000 
X g and 4''C) to separate the phases. The aqueous layer was transferred to a fresh 
tube, and an additional 5 ml of phenol-chloroform mixture was added. The 
solurions were vortexed thoroughly for 2 min and centrifuged again as described 
previously to separate the phases, and this procedure was repeated until very 
little interphase material remained visible. One- tenth volume of 4 M sodium 
acetate (pH 6.0), followed by an equal volume of isopropanol, was added to the 
aqueous phase. The solutions were mixed and left at -20°C for 1 h. The nucleic 
acids were collected by centrifugation at 12,000 x g for 10 min. and the super- 
natant was discarded. The pellet was rinsed with absolute ellianol and vacuum 
dried. The pellet was resuspended in 180 jjlI of distilled water (treated with 
diethyl pyrocarbonate) and 20 \i\ of lOx DNase buffer (0.5 M Tris-HCI [pH 7,8], 
0.05 M MgCy and transferred to an Eppendorf tube. DNase (RNase-free; 
Sigma Chemical Co.) was added to a final concentration of 30 |i.g/ml. The 
solutions were incubated at room temperature for 30 min. An equal volume of 
phenol-chloroform mixture was then added, and the samples were mixed by 
vortexing. The phases were separated by centrifugation in a microcentrifuge, and 
the aqueous phase was transferred to a fresh tube. The aqueous phases were then 
extracted by adding an equal volume of chloroform. Total RNA was precipitated 
with 1/10 volume of 3 M sodium acetate (pH 6) and an equal volume of 
isopropanol for 2 h al -20°C, and the precipitate was collected by centrifugation. 
The RNA pellet was rinsed with 70% and then 100% ethanol, vacuum dried, 
resuspended in 100 \l\ of distilled water, and stored at -70"C. The quantity of 
RNA was assessed by spectrophotometry, and the quality was assessed by aga- 
rose gel electrophoresis. 



TABLE 1. Plasmids used or referred to in the present study 



Plasmid 



Description 



Source or 
reference 



pUC19 

pBluescript SK"*" 

pU702 

pIJ2501 

pIJ2843 

pJSE900 and pJSE910 

pJSE923 

pJSE929 



pJSE935 



Phagemid cloning vector (Stratagene); the vector was modified to contain an Sphl site in the 
polylinker 

The 2.3-kb phsA Sphl structural gene from 5. antibioticus cloned into the Sphl site of pU702 
Streptomyces low-copy-number promoter-probe vector 

The 2.3-kb phsA Sphl cloned in the Sphl site of pBluescript SK'' in two orientations 
pIJ2501 with an Xbal linker at the Pvul site of phsA 

The blunt-ended, ca. 235-bp BsrBl subfragment of the phsA promoter region, extending from 
position -106 to +135 relative to the transcriptional start site, cloned into the Hindi site of 
pUC19 

The ca. 265-bp Hindlll-BamHl subfragment of the phsA promoter region of pJSE929 cloned 
into Ki/idlll-BamHI-digested pIJ2843 
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FIG. 2. Restriction map of pfisA constructed from the plviA sequence. 



Determination of the amino-terminal sequence of the PHS subunit. Tlie ami- 
no-terminal sequences of the cloned and native PHS proteins were determined 
at the Emory University Microchemical Facility and found to be identical. The 
sequence of the first 15 amino acids of the protein is Thr-Asp-Met-Ile-Glu-Gln- 
Ser- Asp-Asp- Arg-Ile-Asp-Pro-Ile- Asp. 

Enzymes and reagents. Restriction endonucleases were purchased from Boe- 
hringer-Mannheim Corporation, Gibco BRL, and Promega Corporation. Calf 
intestinal alkaline phosphatase, T4 DNA ligase, and avian reverse transcriptase 
were obtained from United States Biochemicals. Exonuclease III and mung bean 
nuclease were obtained from Stratagene. Sigma Chemical Co. supplied RNase, 
which was prepared as described previously (46). The T-deaza-dGTP Sequenase 
version 2.0 and Tag version 2.0 DNA polymerase kits were purchased from 
United States Biochemicals. [T-^^pj^^jp [a.32p]dCTP, and a-^^S-dATP were 
purchased from Dupont New England Nuclear Products and Amereham. RNasin 
was obtained from Promega. All of the chemicals were of reagent grade or the 
highest purity commercially available. 

RESULTS 

Nucleotide sequence analysis. A detailed restriction map of 
the phsA Sphl fragment constructed on the basis of the nucle- 
otide sequence is shown in Fig. 2, and the nucleotide sequence 
of the fragment is shown in Fig. 3. Analysis of the DNA 
sequence with the FRAME codon preference program (3) 
revealed a 1,932-bp open reading frame with 71.5% G+C 
content, matching the codon usage of Streptomyces spp. (Fig, 
4). The open reading frame presumably starts with a TTG 
codon at nucleotide 348 and encodes a deduced polypeptide of 
642 amino acids with a predicted of 70,223. Furthermore, 
the predicted initiator amino acid is only one position up- 
stream of the N-terminal amino acid obtained by protein se- 
quence analysis of purified PHS (the first 15 amino acids shown 
in Fig. 3; see Materials and Methods). Additional information 
on the putative translational start was obtained by inserting an 
Xbal linker downstream of this region (Table 1, pJSE923). The 
inserted linker created stop codons in all three reading frames. 
When the resulting recombinant plasmid, containing ihtXbal 
linker in the Pvul site oiphsA, was used to transform S. livi- 
dans, PHS expression from phsA was completely abolished 
(data not shown). These results rule out the possibility that the 
cloned fragment activates a normally silent phsA gene in S. 
antihioticus, as has been observed for S. lividans (25, 37). Up- 
stream of the putative TTG start codon is the sequence 
GGGGG (Fig. 3, boxed), which may act as a ribosome binding 
site (48). A short stem-loop structure is located 4 bp down- 
stream of the phsA stop codon (Fig. 3, inverted arrows), but its 
ability to function in transcription termination is problematic 
because of its length. 

Primer extension analysis and identification of the putative 
phsA promoter. A 24-mer oligonucleotide primer, correspond- 
ing to sequences 530 bp downstream of the 5' Sphl site and 180 
bp downstream of the translational start codon (Fig. 3), was 
used in primer extension studies to locate the 5' end of the 
phsA transcript (Fig, 5). RNA templates were prepared from S. 
antihioticus and S. lividans as indicated in the legend to Fig. 5. 



The transcriptional start point (tsp) of the phs message re- 
vealed by this analysis is located at the A residue which is 313 
bp downstream of the 5' Sphl site and 36 bp 5' to the trans- 
lation initiation codon. The transcription start point of the 
cloned phsA gene in S. lividans TK24 is the same as that of the 
chromosomal gene in S, antihioticus (Fig. 5). In addition, there 
is no difference in the tsp shown in the primer extension ex- 
periments using total RNA prepared from glucose- or galac- 
tose-grown cultures (data not shown). However, glucose-grown 
cultures contained less p/w-specific message than galactose- 
grown cultures. This observation is consistent with earlier data 
suggesting that the decreased level of PHS observed in cultures 
grown on glucose as compared with that in galactose-grown 
cultures is due in part to an effect at the level oiphs transcrip- 
tion (20. 21). 

On the basis of primer extension studies, putative -10 and 
-35 promoter regions were located relative to the transcrip- . 
tion start point (Fig. 3). There are also other interesting fea- 
tures which are located near the promoter region, including 
several sets of direct repeat sequences, two sets of inverted 
repeats, and two TNTNAN sequences (Fig. 3). These se- 
quences are noteworthy because they may be involved in the 
catabolite control of the phsA gene (41). The function of these 
sequences will be examined in detail in subsequent studies. 

Confirmation of the presence of a functional promoter up- 
stream of the transcription start site was obtained by promoter 
probe cloning. In these experiments, a BsrBl fragment from 
phsA (see Fig. 2 and 3) was inserted upstream of the xylE gene 
in the promoter probe vector pIJ2843 (7, 36), The resulting 
recombinant plasmid was used to transform S. antihioticus and 
5. lividans, and mycelial extracts were prepared after 19 h of 
growth of control and transformed cultures in liquid media. 
The results of catechol dioxygenase assays of those extracts 
revealed that the untransformed strains contained negligible 
levels of enzyme activity, as was also the case for strains trans- 
formed with pIJ2843. In contrast, S. antihioticus and S. lividans 
strains containing pJSE935, with the putative promoter frag- 
ment, showed significant levels of xylE activity (data not 
shown). Thus, the BsrBl fragment does possess promoter ac- 
tivity, and the promoter probe results support the identifica- 
tion of the promoter region ofphsA suggested by the sequenc- 
ing and primer extension studies. The use of pJSE935 in 
studies of glucose repression of phsA is described below. 

Sequence comparisons with PHS sequence. The deduced 
amino acid sequence of PHS was compared with entries in 
protein databases provided by GenBank by use of the FASTA 
program. The sequence with the greatest homology to PHS 
was that of bilirubin oxidase from Myrothecium verrucaria (32). 
There is 26% identity and 45% similarity between the se- 
quences of PHS and the bilirubin oxidase protein. A lower 
homology (18% identity, 40% similarity) was found for the 
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VEVPLGPPGTPAPMTEPGRG 
GGO^TCCAACCCAACAAGGACGTCGCCGCGCPGCCCGCCTGGTCCOTCACCCAT^^ 840 
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CACCTCCTCGAACACCAGGACATGGGCATGATGCGGCXrGTTCGTCGTCATOC^^ 

HLLEHEDMGMMRPFVVMPPE 

GCCCTGAAGTTCGACblCGGCGGGGCGCAOGGCGGCCACGGCGAGGGTCAC^ 

ALKPDHGGA HGGHOEGHTG* 

CQCCeO^CCGqSSeESCATGCC "2302 

FIG. 3. Nucleotide sequence of the Sphl fragment containing the S. antihi- 
oticus plisA gene. Important restriction enzyme sites are indicated above the 
sequence. The boxed region denotes the possible ribosomal binding site (Shine- 
Dalgamo sequence (S/DJ). The potential stem-loop is indicated by a pair of 
inverted arrows located toward the end of the sequence. Potential -10 and -35 
regions have lines above the sequence and are discussed in the text. The location 
of the transcription start point is indicated by an upward arrow and bold type, 
and the primer for primer extension is shown by the long arrow under the 
nucleotide sequence around position 530. The direct and inverted repeat se- 
quences are indicated by arrows and numbered in pairs (1 to 8). The TNTNAN 
elements are shown in bold type and are located around 253 and 303 bp from the 
5' Sphl site. Four potential copper binding domains are also indicated in the 
sequence (solid bars). These sequence data appear in the EMBL, GenBank, and 
DDBJ nucleotide sequence data libraries under accession number U04283. 



sequence of manganese-oxidizing protein from Leptothrix dis- 
cophora (8). Copper binding motifs of all three proteins are 
aligned in Fig. 6. All three proteins are involved in oxidation 
reactions, but only PHS and bilirubin oxidase belong to the 
family of blue copper proteins (2, 12, 32). Sequence compari- 
son of PHS with bilirubin oxidase, manganese-oxidizing pro- 
tein, and several other blue copper proteins revealed the pres- 
ence of four regions in the sequence of the former protein 
corresponding to the potential copper binding domains found 
in the sequences of the blue copper proteins (Fig. 6). The 
finding of these copper binding domains confirms PHS as a 
blue copper protein (2, 12). This result is not at all surprising, 
since PHS has been shown to require copper for activity (2). 
The amino acid sequence of PHS contains consensus domains 
for the copper binding regions of the same types (I, II, and III), 
which were revealed by X-ray crystallography of ascorbate 
oxidase from zucchini (40), However, there are just two copper 
binding domains found in the manganese-oxidizing protein. 
We speculate that the copper binding domains are components 
of the catalytic sites of these enzymes. 

Expression of the cloned phsA promoter is repressed by 
glucose in 5. antibioticus. The production of PHS in 5. antibi- 
oticus was demonstrated some years ago to be subject to ca- 
tabolite control (13, 28). As has been mentioned, later studies 
suggested that the production of PHS was regulated at the 
transcriptional level (20, 21). In the present study, the effects of 
glucose on the expression of the promoter active fragment 
cloned in pJSE935 was examined in S. antibioticus, Transfor- 
mants containing pJSE935 were grown on 1% galactose, 1% 
glucose, or a mixture of 0.5% galactose and 0.5% glucose. 
Catechol dioxygenase assays were performed on extracts of 
mycelium harvested 12 h after inoculation of the growth me- 
dia. PHS assays were performed on these same extracts. ThQ 
results of these experiments, presented in Fig. 7, show that 
glucose represses the expression of the phsA promoter in 
pJSE935 in the presence or absence of galactose. Thus, the 
effects of glucose on the pfisA promoter would seem to fit the 
classical definition of catabolite repression, which requires that 
expression of the relevant gene be inhibited when the organism 
in question is grown on the repressing and (relatively) nonre- 
pressing carbon sources simultaneously. It is significant that 
the PHS activity in the mycelial extracts exactly paralleled the 
xylE activity; PHS production was inhibited when the organism 
was grown on glucose alone or on galactose plus glucose (Fig. 
7). 

One possible mechanism for catabolite repression of phsA 
expression would involve the binding of a repressor protein to 
operator sequences in the promoter region of the gene. Such 
mechanisms have been suggested to explain glucose repression 
in other streptomycetes (for examples, see references 9 and 
51). To examine this possibility in 5. antibioticus ^ the effects of 
carbon source on PHS activity were measured in transformants 
containing pJSE923, in which the phsA gene is disrupted by an 
Xbal linker. As controls, PHS activity was measured in un- 
transformed 5, antibioticus and in transformants containing 
pIJ702 and pIJ2501. We reasoned that if the phsA promoter 
region possesses a repressor binding site, it might be possible 
to titrate the repressor by cloning that site at high copy in S. 
antibioticus. However, the result of the experiment was the 
observation that transformants containing pJSE923 showed 
the same pattern of PHS expression when grown on galactose, 
glucose, or glucose plus galactose as did the wild-type strain 
(data not shown). Thus, although the Xbal linker effectively 
prevented expression of the cloned phs gene, the presence of 
. the disrupted gene at high copy did not abolish glucose repres- 
sion of the endogenous phsA gene. 
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FIG. 4. Analysis of the DNA sequence wiih the FRAME program (3) revealed a 1,932-bp open reading frame matching the codon usage ol Streptomyces spp. 



DISCUSSION 

In the present study, we have characterized the cloned PHS 
gene from 5. antibioticus. The of the PHS subunit deduced 
from the nucleotide sequence data is lO.llZ. This value differs 
from the apparent value of 88,000 estimated by sodium dode- 
cyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE) 
(25). The explanation for the anomaJous migration of this 
protein on SDS-PAGE is not clear. Previous studies have not 
revealed the presence of carbohydrate or other substances 
covalently associated with PHS, but other features of the pro- 
tein presumably cause it to migrate in an unexpected fashion. 
Two native forms of PHS, large and small (L and S), were 
reported previously to have M^s of 540,000 and 180,000 and to 
be composed of six and two PHS subunits, respectively (6). On 
the basis of the deduced of the PHS subunit, the corre- 
sponding values for L and S would be about 420,000 and 
140,000, respectively. 

The results of promoter probe cloning and nucleotide se- 
quence analysis of the putative phsA promoter support the 
identity of the -10 and -35 regions and the transcriptional 
start point of phsA. Only a single start point was observed in 
the experiments illustrated in Fig. 5 and their replicates. It is 
possible that the tsp identified here is artifactual, but it is 
significant that the use of that start point identifies -10 and 
-35 regions with significant homology to the P2 promoter of 
the agarase gene (reference 5 and unpublished results). The 
-10 region, TCTCAT, of the phsA promoter showed more 
similarity to the -10 consensus sequence, TATAAT, ofE. coli 
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FIG. 5. Mapping of the 5' end of the phsA mRNA A (t-"PJATP end-labeled 
24-mer oligonucleotide (,5'-GATCTCGGTCTCCCGCGCGTCACC-3') was an- 
nealed to phsA mRNA and extended with reverse transcriptase. The reaction 
products were separated on a sequencing gel with a sequencing ladder, generated 
by the use of the same primer, to determine the transcription start site of phsA 
mRNA, RNA templates were from 5. lividans TK24 (lane 1), 5. antibioticus (lane 
2), and S. lividans transformed with pIJ2501 (lane 3), The arrow indicates the 
primer extension product that conesponds to initiation iromphsA. The sequence 
on the left is the DNA region around the apparent transcription start site for 
phsA, indicated by an asterisk. Although the band corresponding to the extension 
product obtained with RNA from S. antibioticus is faint in the reproduction 
shown here (lane 2), it was cleariy visible on the original auto radiograms. 
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FIG. 6. Alignment of the putative copper-binding motifs of PHS, several blue 
copper proteins, and manganese-oxidizing protein. Sequence identities between 
PHS and the other blue copper proteins and manganese-oxidizing protein are 
boxed. The numbers to the left of the motifs denote the positions in the corre- 
sponding protein sequences. The amino acid residues corresponding to potential 
copper binding sites of three recognized types (40) are shown as follows: type I, 
*i; type II, *2; type III, *3. Dashes represent gaps introduced to maximize the 
similarity. Protein sequences were from the following sources: BO, Myrothecium 
vermcaria bilirubin oxidase (32); CP, human ceruloplasmin (33); A VA^ Aspergil- 
lus nidutans laccase (1); N. LA, Neurospuru crassa laccase (14); C, AO, cucumber 
ascorbate oxidase (44); Z. AO, zucchini ascorbate oxidase (40); PC, polar plas- 
tocyanin (43): AZ, Alcaligenes denitrificans azurin (43); MO, Leptothrix disco- 
phora manganese-oxidizing protein (8). 



(15, 16) than to the -10 consensus sequence, TAGGAT, of 
Streptomyces promoters (48). The -35 region of the putative 
phsA promoter was not strikingly similar to the -35 consensus 
sequence of either E, coli or Streptomyces promoters (Fig. 3) 
(see references 15, 16, and 48). Overall, the phsA promoter is 
not strongly homologous to any promoters for other antibiotic 
genes from Streptomyces spp. (48). However, recent studies do 
suggest similarities to the P2 promoter of the agarase gene 
{dagA) from Streptomyces coelicolor (reference 5 and unpub- 
lished data). Preliminary data also suggest that the phsA pro- 
moter is recognized by an alternative a factor, (34), The 
role of this a factor in 5. antibioticus will be described in a 
subsequent publication. 

One noteworthy feature of the phsA sequence is the pres- 
ence of several sets of direct and inverted repeats near the 
promoter region (Fig. 3). This is especially interesting since 
some direct and inverted repeat sequences have been reported 
to be involved in the regulation of gene expression in strepto- 
mycetes. For example, repeated sequences have been impli- 
cated in the catabolite control of Streptomyces genes, including 
the chitinase genes of Streptomyces plicatus (9), the galPl pro- 
moter of the galactose operon of S. lividans (41), and a-amy- 
lase promoters of Streptomyces limosus (51), None of thephsA 
direct or inverted repeat sequences is strikingly similar to the 
repeat sequences in the studies described above. However, it is 
possible that repeat motifs are a common feature of the re- 
gions involved in catabolite repression of streptomycete genes. 
The phsA sequence also contains two TNTNAN elements. 



located within the -10 region and upstream of the phsA pro- 
moter region (Fig. 3). TNTNAN hexamers were suggested to 
play a role in galPl regulation in 5. lividans (41). 

The predicted amino acid sequence of the PHS subunit 
resembles that of proteins belonging to the blue copper protein 
family. Like most members of this group (32), the sequence of 
the PHS subunit contains four consensus domains (1 to 4) that 
are presumed to bind the copper ligands (Fig. 6). Domains 1 
and 2 are located at the N-terminal portion of the protein, 
whereas domains 3 and 4 are nearer the C terminus. Even 
though manganese-oxidizing protein does not belong to the 
blue copper protein family, similarities were observed in the 
copper binding domains 1 and 2 between PHS and the man- 
ganese-oxidizing protein (Fig. 6). In spite of the diverse distri- 
bution of these proteins and their utilization of very different 
substrates, they all use molecular oxygen in the reactions they 
catalyze. Although the active sites of these enzymes have not 
been characterized, the conservation of the copper binding 
sites strongly suggests their involvement in substrate recogni- 
tion and catalysis. 

In this study, we provided evidence for the regulation of the 
plisA promoter by catabolite repression. Thus, growth of 5. 
antibioticus containing the cloned phsA promoter on glucose or 
glucose plus galactose led to a significant inhibition of xy/£ 
expression from pJSE935 as compared with that of cultures 
grown on galactose alone (Fig. 7). An identical pattern of 
inhibition was observed for PHS activity in the same cultures. 
There are several important implications of this result. First, 
the data strongly suggest that the sequences required for ca- 
tabolite repression are contained within the BsrBl fragment 
cloned in pJSE935, This fragment lacks the repeats 1, 2, and 8 
of Fig. 3. Thus, those sequences are presumably not required 
for catabolite repression. Second, it is clear that whatever the 
mechanism of catabolite repression in S. antibioticus, the rel- 
evant machinery can act simultaneously on the endogenousp/w 
promoter and on the cloned sequence, since PHS actwity par- 
allels jcy/£ activity in the experiments illustrated in Fig. 7. With 
regard to that mechanism, we presented evidence above that it 
may not involve a simple interaction between an operator and 




xylE PHS xylE PHS xylE PHS 

FIG. 7. Effects of carbon source on the expression of itiephsA promoter in S, 
antibioticus. Transformants containing pJSE935 were grown on galactose, glu- 
cose, or glucose plus galactose as described in Materials and Methods. The 
figures shows the results of catechol dioxygenase and PHS assays of extracts of 
mycelium harvested 12 h after inoculation. Results represent the averages of 
three replicates. The values obtained for extracts grown on galactose (42.4 ±2.1 
mU/mg of protein for catechol dioxygenase and 75.6 ± 5,3 U/mg of protein for 
PHS) were arbitrarily set at 100 for purposes of presentation. 
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a repressor as it was not possible to release expression of the 
endogenous phs gene from catabolite repression by cloning the 
disrupted gene at high copy. It is possible, of course, that the 
repressor binding site involves sequences that were disrupted 
by the insertion of the Xbal linker. It should be possible to 
distinguish between these possibilities and to learn more about 
the mechanism of catabolite repression oipiisA expression by 
gel mobility shift assays. 
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The great majority of viral mRNAs in mouse C127 cells transformed by bovine papillomavirus type I (BPV) 
have a common 3' end at the early polyadenylation site which is 23 nucleotides (nt) downstream of a canonical 
poly(A) consensus signal. Twenty percent of BPV mRNA from productively infected cells bypasses the early 
polyadenylation site and uses the late polyadenylation site approximately 3,000 nt downstream. To inactivate 
the BPV early polyadenylation site, the early poly(A) consensus signal was mutated from AAUAAA to 
UGUAAA. Surprisingly, this mutation did not result in significant read-through expression of downstream 
RNA. Rather, RNA mapping and cDNA cloning experiments demonstrate that virtually all of the mutant RNA 
is cleaved and polyadenylated at heterogeneous sites approximately 100 nt upstream of the wild-type early 
polyadenylation site. In addition, cells transformed by wild-type BPV harbor a small population of mRNAs with 
3' ends located in this upstream region. These experiments demonstrate that inactivation of the major poly(A) 
signal induces preferential use of otherwise very minor upstream poly(A) sites. Mutational analysis suggests 
that polyadenylation at the minor sites is controlled, at least in part, by UAUAUA, an unusual variant of the 
poly(A) consensus signal approximately 25 nt upstream of the minor polyadenylation sites. These experiments 
indicate that inactivation of the m^or early polyadenylation signal is not sufficient to induce expression of the 
BPV late genes in transformed mouse cells. 



Eukaryotic RNA polymerase 11 transcription units arc typi- 
cally transcribed past the mature mRNA 3' end. These tran- 
scripts are then cleaved and a poly(A) tract of 200 to 300 
nucleotides (nt) is added to generate the 3' end of the mature 
mRNA (for reviews, see references 30 and 37), Eighty to 
ninety percent of animal cell mRNAs contain the sequence 
AAUAAA 10 to 30 nt upstream of the poly(A) tail. Another 
10% have the variant AUUAAA; other variants arc rare (38). 
These consensus sequences have been shown to be required 
for efficient and accurate cleavage and polyadenylation both in 
vivo and in vitro ( 1 7, 27, 29). Generally, when this sequence is 
mutated, polyadenylation occurs at a downstream site, often 
with reduced efficiency (17). The region upstream and down- 
stream of the AAUAAA consensus signal, including GU-rich 
downstream sequences, has also been identified as playing a 
role in the cleavage and polyadenylation of some transcripts 
(30, 37). 

Bovine papillomavirus type 1 (BPV) induces fibropapillomas 
in cattle and transforms a number of cultured rodent fibroblast 
cell lines to tumorigenicity. The papillomaviruses are unable to 
propagate in such transformed cells, in part because the early 
polyadenylation site used by essentially all BPV transcripts in 
transformed cells is located between the transcriptional pro- 
moters and LI and L2, the two genes which encode the virion 
proteins (Fig. lA) (16, 23, 39). Similarly, in BPV-induced skin 
fibropapillomas, usage of this early polyadenylation site pre- 
cludes expression of the capsid protein genes in transformed 
dermal fibroblasts and presumably in the basal keratinocytes as 
well (4, 5, 35). In terminally differentiating keratinocytes which 
express the capsid proteins and produce virus, about 20% of 
the viral mRNA reads through the early polyadenylation site 
and is instead polyadenylated approximately 3,000 nt down- 
stream at the late polyadenylation site (5). Thus, regulation of 



polyadenylation at the early site appears to be crucial for viral 
late gene expression. 

To study signals that control polyadenylation in BPV-trans- 
formed mouse CI 27 cells, we mutated the early poly(A) 
consen.sus signal AAUAAA, located 23 nt upstream of the 
early poIy(A) site. It was expected that mutant transcripts 
would now bypass the early poly(A) site and that late region 
sequences would be included in stable RNA. RNA mapping 
experiments instead demonstrated that mutant transcripts 
were polyadenylated at heterogeneous sites approximately 100 
nt upstream of the early polyadenylation site used in cells 
transformed by wild-type BPV. Evidence is presented which 
suggests that an unusual variant of the poly(A) consensus 
sequence, UAUAUA, plays a role in the regulation of poly- 
adenylation at the upstream polyadenylation sites. 

Construction and preliminary characterization of the 
poIy(A) consensus mutant. To disrupt polyadenylation at the 
BPV major early polyadenylation site at nt 4203, oligonucle- 
otide-directed mutagenesis was used (o mutate the poly(A) 
consensus signal at nt 4180 from AAUAAA to UGUAAA 
(Fig. I B), thereby creating a new PvuW cleavage site (23a). The 
resulting mutation on a BstX\-io-Sal\ fragment was recon- 
structed into the full-length wild-type BPV genome (clone 
pBPV-l42-6 [33]) to generate mutant pBPV-EPAl. Nucle- 
otide sequence analysis of the fragment replaced in generating 
pBPV-EPAl (nt 3849 and 4450) demonstrated that no extra- 
neous mutations were introduced during mutagenesis. 

The ability of three isolates of pBPV-EPAl to transform 
CI 27 cells was assayed by determining the efficiency of focus 
formation after BamHl digestion to release the viral DNA 
from the plasmid vector and transfection as described previ- 
ously (14). All three mutant isolates transformed cells with 
approximately the same efficiency as wild-type BPV DNA 
(data not shown). Cell lines were derived from pools of foci 
induced by pBPV-EPAI (EPAlp) and by wild-type BPV DNA 
(142-6p). ID13 cells, a C127 cell line transformed by infection 
with BPV, were used as an additional wild-type control. 
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FIG. 1. The BPV genome and design of the EPAl mutation. (A) 
The BPV genome linearized at the 3' end of the late transcription unit 
is shown. Open boxes indicate translational open reading frames. The 
long horizontal arrow indicates the direction of transcription. The 
short horizontal arrows indicate the positions of major promoters. The 
late promoter is designated Pl. Ag and Al denote the early and late 
polyadenylation sites, respectively. B indicates the position of the 
unique BamHl site. Nucleotide numbers are shown at the bottom. (B) 
The EPAl mutation. The top line shows the wild-type BPV DNA 
sequence around the early polyadenylation signal. The sequence of the 
mutagenic oligonucleotide LI is shown directly below it, with the base 
substitutions shown in boldface. The template was the small BamHl to 
£coRI fragment of BPV-1 DNA cloned in M13mp8 (13), The se- 
quence at the bottom shows the mutation with the new Pmll site 
indicated. The open reading frame L2 initiation codon is designated 
the MET codon. 



Southern blot analysis of viral DNA from transformed cells 
demonstrated that the mutant viral DNA was maintained in 
transformed cells as a multicopy plasmid without gross rear- 
rangement and with restoration of the BamHl site used to 
excise the viral DNA from the plasmid vector (data not 
shown). 

Mapping the 3' end of the mutant mRNA. The mutation in 
pBPV-EPAl was designed to eliminate polyadenylation at the 
wild-type early polyadenylation site immediately upstream of 
the late open reading frames. Extensive Northern (RNA) blot 
analysis and RNA protection experiments failed to detect 
significant amounts of RNA extending past the polyadenyla- 
tion site into the late region, but these experiments did not 
exclude the presence of low levels of read-through RNA (data 
not shown). There was severalfold more stable viral RNA in 
cells transformed by wild-type BPV than in those transformed 
by the polyadenylation site mutant (Fig. 2). 

RNase protection experiments were performed to map the 
3' ends of the mutant transcripts. ID13 and EPAlp RNAs were 
assayed for protection of an antisense EPAl RNA probe 
spanning the early polyadenylation site at nt 4203 (Fig. 2, left 
panel). The size of the fragment protected by RNA from ID13 
cells indicates that, as expected, the wild-type viral RNA 
extends past nt 4180, the site of the mutation in the probe (lane 
c). In contrast, EPAlp RNA protected several fragments 
approximately 100 nt shorter than those protected by wild-type 
RNA, suggesting that the mutant RNA is polyadenylated 
upstream of the normal position (lanes a and b). The differ- 
ence in the pattern of protected bands between the two EPAlp 



lanes, a and b, is due to the different cleavage specificities of 
the two RNases used in these reactions. The same result was 
obtained with oligo(dT)-selected EPAlp RNA (data not 
shown), indicating that these shorter species are polyadenyl- 
ated, a conclusion confirmed by cDNA cloning (see below). 
There was no evidence of significant polyadenylation of 
EPAlp RNA at the usual position, nor were prominent shorter 
novel bands protected in the ID13 sample. RNA from two 
additional cell lines generated with the original isolate of the 
mutant and two additional cell lines generated with indepen- 
dent isolates of the mutant showed the protection pattern 
characteristic of the mutant (data not shown). These results 
suggested that sequences downstream of nt 4100 were absent 
from mutant RNA, an interpretation supported by the results 
of protection experiments with additional antisense probes and 
the results of Northern blot hybridization experiments with 
oligonucleotide probes (data not shown). These results are 
interpreted in the right panel of Fig. 2. 

cDNA cloning and sequencing. The results presented above 
suggest that new heterogeneous polyadenylation sites near nt 

4100 are utilized in EPAlp RNA. To confirm this interpreta- 
tion, the 3' ends of both wild-type and mutant RNAs were 
cloned and sequenced. 01igod(T)-selected (3) 142-6p and 
EPAlp RNAs were reverse transcribed with oligo(dT) as 
primer and the reagents and protocol of a cDNA synthesis kit 
(Amersham). The resulting first-strand cDNAs were amplified 
by the polymerase chain reaction (PGR) method with the 
primers diagrammed in Fig. 3A (18, 32, 34). To specifically 
amplify BPV sequences, the upstream PGR primer PCR5 
corresponded to BPV nt 3998 to 4031. To selectively amplify 
polyadenylated molecules, the downstream PGR primer PCRT 
was 5' d(GGGGATGCT25) 3', which hybridized to any prod- 
uct containing a poly(A) tract. Annealing was carried out at 
25*G, because PGRT has a calculated T„ of 38.9*'C in PGR 
buffer conditions (32). The products of each amplification 
reaction were cloned into pUC18, and colonies containing an 
insert were identified by colony hybridization (22) with an 
oligonucleotide probe PGRl complementary to a region (nt 
4063 to 4089) between the upstream primer and the proposed 
3' end of mutant RNA. 

The results of sequence analysis of the cDNA clones are 
summarized in Fig. 3B. Sites of polyadenylation were identified 
as junctions between BPV DNA sequence and tracts of 
poly(A). Sbc of the 1 1 clones derived from cells transformed by 
wild-type BPV were polyadenylated after nt 4203, the previ- 
ously described early potyadenylation site (39), thus validating 
this strategy of identifying polyadenylation sites. In contrast, 
none of the clones derived from mutant RNA displayed the 
wild-type polyadenylation site. Instead, seven of the nine 
EPAlp clones contain a stretch of poly(A) immediately after 
BPV nt 4107, and the other two clones contain poly(A) after nt 

4101 and 4092. These results are consistent with the RNase 
protection and Northern blot results which indicate the exis- 
tence of heterogeneous 3' ends near nt 4100 in mutant RNA 
and demonstrate that these new 3' ends are in fact new sites of 
polyadenylation. Interestingly, the anomalous clones (almost 
half) derived from wild-type RNA showed polyadenylation at 
heterogeneous sites similar to those found with the mutant 
RNA. These results indicate that there is a population of 
mRNAs with heterogeneous polyadenylation sites around nt 
4100 in cells transformed by wild-type BPV. The preferential 
amplification of shorter PGR products may explain the rela- 
tively frequent isolation of these shorter cDNAs from cells 
transformed by wild-type BPV. We have occasionally observed 
faint bands in protection experiments with ID 13 RNA which 
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FIG. 2. RNase protection analysis of viral early region RNA in transformed cells. (Uft panel) Ten micrograms of total cellular RNA (9, 21) 
from EPAlp (lanes a and b). ID13 (lane c), and CI27 (lane d) cells was hybridized to an antisense EPA1 RNA probe (28) complementaiy to BPV 
nt 3912 to 4450, spanning the position of the normal early polyadenylation site but containing the mutation at the polyadenylation signal. Hybrids 
were digested with either 2 jig of RNase Tl per ml (lane a), 40 »ig of RNase A per ml (lane b), or a mixture of both RNases (lanes c to e), and 
protected fragments were detected by autoradiography after electrophoresis through a 4% polyaciylamide-50% urea gel The sample in lane e was 
the probe digested after mock hybridization; the sample in lane f is undigested probe. The nucleotide lengths of size markers in lane g are mdicated. 
The arrowhead indicates the predicted position of a 270-nt fragment extending from the 3' end of the probe to the site of the mutation, which i.s 
generated by cleavage at the mismatch between the wild-type RNA and the mutation in the probe. The vertical line on the left indicates the small 
cluster of bands protected by mutant RNA. (Right panel) Schematic representation of the probe, protected fragments generated by RNase 
digestion, and the deduced structure of viral RNA species. Arrows indicate the direction of transcription, with the arrowheads representing the 
3' end of each transcript. The X indicates the position of the mutation in the probe. 



are consistent with minor sites of polyadenylation at these 
upstream positions (data not shown). 

Identification of a signal controlling polyadenylation at the 
upstream sites. There is no poly(A) consensus sequence or 
previously described functional variant within 100 bp upstream 
of nt 4100. However, the sequence UAUAUA is present at nt 
4073, approximately 30 nt 5' to the poly(A) sites in EPAlp 
RNA (Fig. 3B). It is the closest match to the consensus 
sequence in the region, and it appears to be in the appropriate 
position to specify cleavage at the sites detected in mutant 
RNA. To test the role of this sequence in specifying polyade- 
nylation in the absence of the wild-type signal, it was mutated 
from UAUAUA to GAUAUC by using the mutagenic primer 
5' d(AA<rrrCATAC AGGATATCAA ACAAATCA)3', cor- 
responding to BPV sequence from nt 4063 to 4090, and 
single-stranded EPAl DNA as a template. The resulting 
mutant, pBPV-EPA2 (see Fig. 5) therefore contained both the 
original mutation at the poly(A) consensus signal and the new 
mutations in the putative variant signal. This mutant trans- 
formed CI 27 cells with approximately wild-type efficiency, and 
RNA from a pooled cell line transformed by EPA2 DNA was 
mapped by using RNase protection and an antisense EPA2 
probe (Fig. 4). RNA from ID13 cells protected the fragment 
sizes predicted if cleavage occurred at the sites of mismatch 



between wild-type RNA and the probe (which contains muta- 
tions at nt 4073. 4078, 4180, and 4181) (lane b). EPA2 RNA 
protected two major size classes of fragments (lane a). One was 
a set of probe fragments approximately 190 to 200 nt long, 
corresponding to polyadenylation near nt 4100 as in EPAlp 
RNA. These protected fragments comigrate with the frag- 
ments protected by EPAl RNA (data not shown) and are the 
size predicted if polyadenylation occurred at nt 4107, the 
mutant site mapped by cDNA cloning. In addition, EPA2p 
RNA protected several longer fragments corresponding to 
heterogeneous RNA 3' ends between nt 4200 and 4450. The 
EPA2 mutation thus reduced the efficiency with which the 
upstream polyadenylation sites are used, but it did not appear 
to affect the position of poly(A) addition for those transcripts 
that are successfully polyadenylated in this region. These 
results indicate that the UAUAUA plays a role in specifying 
the new upstream sites of polyadenylation in EPAlp RNA. 

Discussion. These experiments were designed to study poly- 
adenylation site usage in BPV-transformed mouse cells. A 
point mutation in the poly(A) consensus signal disrupted 
polyadenylation at that site both in vivo, as demonstrated here, 
and in an in vitro polyadenylation system (24). RNase protec- 
tion. Northern blotting, and cDNA cloning and sequencing 
established thai stable mutant transcripts utilized heteroge- 
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FIG. 3. (A) PCR-based strategy to clone the 3' ends of viral RNA. 
The top portion of the panel shows the deduced structure of the 3' 
ends of the major viral RNAs. The upstream primer, PCR5, is 
complementary to all known wild-type and mutant viral early RNAs. 
The downstream primer, PCRT, consists of oligo(dT) and a cloning 
site but no BPV-specific sequences. After amplification with cDNA 
reverse transcribed from polyadenylated RNA as a template, BPV 
cDNAs were cloned into pUC18, identified by hybridization to PCRl, 
and sequenced. (B) cDNA clone sequences. The sense strand BPV 
sequence from nt 4011 to 4210 is shown. The normal poly(A) 
consensus signal is enclosed in the solid line, and the putative upstream 
poty(A) signal is enclosed in the dashed line. Each dot indicates the 
position of a junction between BPV DNA and the poly(A) tract in a 
cDNA clone. Qones derived by amplification of wild-type RNA are 
represented by dots below the sequence, and those derived from 
mutant RNA are represented by dots above the sequence. 



neous polyadenylation sites approximately 100 nt upstream of 
the wild-^e polyadenylation site. The results of the cDNA 
cloning also demonstrated that some wild-type transcripts have 
heterogeneous 3' ends in the region used by the mutant RNAs, 
indicating that this is a minor polyadenylation site in cells 
transformed by wild-type BPV. In a related system, Doniger et 
al. (15) found usage of upstream polyadenylation sites by 
human papillomavirus type 16 transcripts in an immortalized 
human exocervical epithelial cell line harboring a human 
papillomavirus type 16 genome with an extensive deletion 
immediately downstream of a wild-type early poly(A) signal. 
The 3' ends of viral RNA from these cells mapped to both the 
normal site and to a heterogeneous region 400 to 500 nt 
upstream of that site. 

The results described here suggest a hierarchy of polyade- 
nylation site usage in BPV-transformed cells, as is summarized 
in Fig. 5. Wild- type BPV mRNA is polyadenylated at the major 
polyadenylation site at nt 4203, with a small fraction of 
transcripts being polyadenylated at minor upstream sites 
around nt 4100. When the major signal is disrupted (as in 
EPAl), the sites around nt 4100 become the predominant sites 
of polyadenylation. When the major signal is inactivated and 
the minor signal is partially disrupted (as in EPA2), both the 



upstream sites and new downstream sites between nt 4200 and 
4450 are used. Additional experiments have shown that poly- 
adenylation occurs exclusively at these downstream sites when 
the major signal is inactivated and the upstream polyadenyla- 
tion region is deleted (2). There are several potential poly- 
adenylation signals in this downstream region, including a 
sequence at nt 4304 that deviates by 1 nt from the consensus 
polyadenylation signal. In addition, Burnett et al. (8) observed 
polyadenylation near nt 4450 in RNA from cells transformed 
by a spontaneous BPV-1 deletion mutant lacking the major 
poIy(A) site and surrounding sequences. One can speculate 
that the function of the multiple potential early polyadenyla- 
tion sites in BPV is to ensure that late genes are not expressed 
under inappropriate conditions, for example in transformed 
dermal fibroblasts or basal epidermal keratinocytes. 

Polyadenylation site selection appears to be a complex 
process that takes into account both the relative strengths of 
potential sites and their positions relative to one another (12, 
20). Moreover, the representation of polyadenylation sites in 
stable RNA reflects a number of factors in addition to poly- 
adenylation site selection, including the stability of various 
RNA species. The results of the RNase protection experiments 
reported here indicate that the upstream polyadenylation sites 
are used far more abundantly by the early polyadenylation 
signal mutant than by the wild type. However, it is also clear 
that there is less total viral RNA in cells transformed by the 
mutant. It is possible that processing at the upstream sites 
remains relatively inefficient even with the mutant polyadenyl- 
ation signal, resulting in the synthesis of a rapidly degraded 
poo! of unprocessed RNA extending into the late region. In 
fact, Furth and Baker (19) have described a sequence element 
in the BPV late region which prevents the accumulation of 
stable viral RNA in transformed cells. 

The closest match to a poly(A) consensus signal in the 
vicinity of the minor upstream polyadenylation sites is 
UAUAUA, approximately 25 nt upstream of the new RNA 3' 
ends. RNA from cells containing mutations of both the original 
poly(A) signal and this putative upstream signal contains 
heterogeneous 3' ends at both the upstream sites and at 
additional positions downstream of the normal site. This result 
suggests that the UAUAUA plays a role In directing polyade- 
nylation at the upstream polyadenylation sites and that the 
mutation did not fully disrupt the function of the UAUAUA 
sequence. We are not aware of a precedent for UAUAUA 
acting as a poly(A) signal in mammalian cells, although it can 
direct mRNA 3' end formation and polyadenylation in Sac- 
charomyes cerevisiae (31). However, we note that the region 
around the upstream cleavage sites contains numerous oli- 
go(dT) tracks and GT dinucleotides, sequence motifs found 
near some bona fide mammalian poly(A) signals. 

The wild-type poly(A) consensus signal appears to suppress 
utilization of the variant signal located approximately 100 nt 
upstream. Such suppression may be rather general. Connelly 
and Manley (10) studied the simian virus 40 early polyadenyl- 
ation region, which contains two closely spaced AAUAAA 
signals. In the wild-type situation, only the 3' site is efficiently 
utilized. However, if this preferred site was inactivated by 
mutation, increased usage of the 5' site was observed. In 
addition, Denome and Cole (11) showed that addition of 
tandemly arranged polyadenylation signals decreased usage of 
the upstream site. These findings imply that genomes may 
contain numerous potential sites of polyadenylation whose 
activity is suppressed by the relatively close apposition of 
another polyadenylation signal, which perhaps competes more 
efficiently for a limiting polyadenylation factor. Therefore, 
alternative polyadenylation, which is a well-documented con- 
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FIG. 4. Evidence that UAUAUA at nt 4073 plays a role in poIy(A) site selection. (Left panel) Radiolabelled antisense RNA probe extending from 
BPV nt 4450 to 3912 was transcribed in vitro from EPA2, hybridized to 10 M-g of cellular RNA isolated from EPA2p (lane a). ID 13 (lane b). or CI 27 
(lane c) cells, and digested with a mixture of RNa.ses A and Tl. Protected fragments were subjected to polyaciylamide gel electrophoresis and detected 
by autoradiography. The arrowhead on the left indicates the position of approximately 190- to 200-nt probe fragments extending from the 3' end of 
the probe to the upstream sites of polyadenylation around BPV nt 4100. The vertical line on the left indicates the position of the larger fragments 
also protected by mutant RNA, The approximately 265-base fragment in lane b appears to be derived from partially digested hybrids. The lengths (in 
nucleotides) of coelectrophoresed size markers are shown. P indicates the position of undigested probe (538 nt). (Right panel) Schematic 
representation of the antisense probe, sizes of the protected fragments, and deduced structures of viral RNA species. The X's show the positions of 
the mutations at the upstream and downstream polyadenylation signals in the probe and in RNA isolated from cells transformed by the double 
mutant. 



trol point for regulating gene expression (25), may result in 
some cases from inactivation of a preferred polyadenylation 
signal rather than by direct activation of a suboptimal one. 
The mechanism by which BPV prevents expression of viral 
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FIG. 5. Usage of early region polyadenylation sites. The horizontal 
lines represent the region of the BPV genome around the early 
polyadenylation site for wild-type BPV DNA (142-6) and the indicated 
mutants. Transcription proceeds from left to right. The unbroken box 
represents the normal early polyadenylation consensus signal at nt 
4180, and the dashed boxes represent the putative upstream polyade- 
nylation signal at nt 4073. The vertical arrows indicate the positions of 
poly(A) addition, and triple arrows indicate heterogeneous polyade- 
nylation sites, with minor sites represented as dashed arrows. Boxes 
containing an X indicate a mutant polyadenylation signal. 



late genes in transformed cells but allows their expression in 
differentiated keratinocytes is central to an understanding of 
papillomavirus biology. One level of restriction in transformed 
cells is clearly at the level of stable mRNA accumulation, 
because little or no BPV mRNA from the late region is present 
in cultured fibroblasts. Analysis of nascent RNA from ID 13 
cells indicates that at least 90% of BPV transcripts terminate 
between the early and late poly(A) sites and therefore never 
reach the late poly(A) signal (6). The mechanism(s) allowing 
production of late RNAs during natural infection may act 
primarily at the level of polyadenylation site selection, or it 
may act at some other steps in mRNA biogenesis, such as 
alterations in promoter usage, splicing patterns, or transcrip- 
tion termination, which secondarily affect cleavage and poly- 
adenylation (for examples, see references 1, 7, 26, and 36). 
However, the results presented here indicate that specific 
suppression of the major early polyadenylation signal is un- 
likely to be the sole step in releasing the block to BPV late 
gene expression, because inhibition of polyadenylation at 
additional potential early sites must also occur. The mecha- 
nism involved in late gene expression must coordinately sup- 
press cleavage and polyadenylation at multiple potential sites 
near the 3' end of the early region in some of the transcripts, 
while many transcripts are still polyadenylated at the early 
polyadenylation site. Regulation of BPV late gene expression 
is clearly a complex process and bypass of the early major 
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polyadcnylation site is a necessary but not sufficient compo- 
nent of that process. The study of BPV transcriptional regula- 
tion promises to provide insights into not only papillomavirus 
biology but also the mechanisms of regulation of gene expres- 
sion in general. 
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INTRODUCTION 

Biological systems are the masters of chemical syn- 
thesis. The remarkable specificiiy of their catalysts, 
the enzymes, allows hundreds of reactions to proceed 
simultaneously Inside the tiny reactor that is a hving 
ccIL Eitzymes' ability to carry out complex chemical 
reactions, and to do so under very mild conditions 
with virtually no waste products, has earned them the 
admiration of chemists and biochemists, k is easy to 
envision that a future diemical industry sensitive 
to both energy needs and the environment could 
be modeled after these highly efficient chemical 
factories. 

The molecules responsible for this remarkable per- 
formanoe ere the enzymes. Enzymes arc proteins, lin- 
ear chains of typically hundreds of amico adds that 
fold up into unique, well-defined thice-dimensional 
structures. The backbone of the polymer chain folds 
into a structure that is unique xo the particular cata- 
lyst, as iDustraied in Tig. 1(a) for the enzyme sub- 
tilisin. The enzyme's substrate (gray), the compound 
on which the reaction is catalysed, fits snugly into the 
substrate binding pocket. The enzyme positions speci- 
fic catalytic amino add aide chains (red) where they 
ean assist the chemical reaction to proceed. In 
Fi£. 1(b) the structure of sublilisin showing its amino 
add side chains illustrates the complexity of these 
molecular machines. This complexity allows enzymes 
to perform the truly impressive functions that support 
life and create new life. The result of considerable fine- 
tuning over eons of evoludon, this complexity also 
makes it difficult to manipulate these structures to 
obtain new and interesting properties. 

An enzyme is defined by a tmique sequence of 
&xnino actds. which in turn is diaated by the organ- 



ism's DNA code (the gene) and assembled in the cell 
(Fig. 2). This amino acid sequence determines how the 
chain folds and, ultimately, how the enzyme functions- 
By modifying the amino acid sequence^ we can alter 
the enzyme's function— this field is known as protein 
engimering. Despite intense research into funda- 
mental features governing protein folding and func- 
tion, there arc enormous gaps in our understanding of 
two critical processes: the relationship between se- 
quence and structure and the relationship between 
structure and function. As a result, the rational design 
of new proteins by the classical ^redurtionist* ap- 
proach can be a frustrating excrdse indeed. In this 
article I will introduce a new and highly effective 
approach to enzyme design and engineering that by* 
passes the need to understand these processes before 
embarking on a protein engineering project. But first 
I wfll explain why the enzymes provided by nature are 
not sufficient. 

Chemical engineers who try to design real indus- 
trial processes using biological catalysts are constant- 
ly stymied by a simple fact: biological systems have 
evolved over billions of years to perform very specific 
biological functions and to do so within the context of 
a living organism. Some of the features required for 
function in a complex chemical network are undesir- 
able when the catalyst is lifted out of context. Con- 
versely, many of the properties we wish an enzyme 
would have clash with the needs of the organism, or 
at least were never required. The chemical engiaecr 
is hardly impressed by a catalyst whose inability 
to tolerate the most common of industrial condidons 
necessitates complicated hardware and reactors of 
the size of football fields. We need cataiycts which 
arc subic to high temperatures, can funcdon in sol- 
vents other than water, tolerate wider ranges of pH, 
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Ttg. i Proicirt engineerim involves the ciftoipukilon of protein ttruciuitt and functions ai ihe level of the 
ammo acid for DNA) scqutatx. Significant gaps in our undcrsunding of the rtliiionshipi between 
«equcocB, Mivcf ure and functioo *evcrcly limit our abfliry to ^ttoaairy dcgign* new functioat. 



catalyse icacqoos on substrates not eocoumeiBd in 
nature^ and even catalyse new reactions noi found m 
nature. 

Many dues es to how to ex\^oeer better mzymcs 
come from studying how nature has created enzymes. 
By studying the evolution ofnaturul proteins, we have 
kaned b bet that they are highly adaptable; constandy 
changing molecules, ar last over evolutionary titnc 
scales. They can adapt to new environments and they 
can even take on new tasks. Wc know, for example, 
that many Bnzymcs cataiysing very diflTcrcnt reactions 
have come about by divergent evolutiou from a com- 
mon ancestral protein of the satnc general structure, 
acquiring diverse capabilides by processes of random 
mutaiiout recombinaiSon, and natural selection. For 
example, the versatile protein structure knowu as the 
a/P barrel diverged somewhere in the distant past to 
create a whole scries of enzymes wc know today 
(Reardon and Farber, 1 995), The four enzymes shown 
in Fig. 3(a), for example, catalyse quite different rcuc- 
tions; their physical properties and atnino acid se- 
quences are also qujie disparate. It is useful to note 



ihat» while the barrel-like protein fold is highly con- 
served« the amino add sequences and functions of 
these enzymes are not 

A fesdnating reetTit example ofenzyme evolution is 
the appearance of phosphotricsterase, an ct/0 barrel 
enzyme thai hydrolyscs, at difTusion-limited rates, 
pesticides and diemical warfare agents that have 
existed only for about 50 years. It has been suggested 
that this enzyme, discovered in a soil bacterium, evol- 
ved during the last SO years from a related sequence 
identified in the pommon E. coU bacterium and now 
known as the 'phosphotriesterasc homology protein' 
<Scanlan and Rcid, 1 995). The biological function of 
this latter protein is unknown. 

also know ihai enzymes of a given function (for 
example, all catalysing a panicular step in a metabolic 
pathway) can exhibit widely different properties (stab- 
iiiiy, solubility, tolerance to pH, etc.), depending on 
where they arc found* For example, the three glyoeral- 
dehydfi phosphate dehydrogenase (GAPDH) enzymes 
listed in Fig. 3(b) have very similar three-dimensional 
sTructures; ibdr sequences arc leas similar. Wc know 
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Fifi. 1. The 275 amino acids ofsubiilism E foIU huo u unique ihrcc-dimcnsjonal «if udure. (d) The baclcbone 
loJd » rcprcscmcd bore by a ribbon" diagram, ccnuirucied from X-ray ciy*iftl Wruciurc coordiA4iev(Duutcr 
ct uL 1991) usm^ ihc programs MolScripi and Ra<ier3D. Pcpiido Kubsimtc and iwo «iahilia:infi caJdur» 
ionii are Jiluwn in wjv. Side chains of cataJyiic ominoacid rcsjdu»urc <hou n in red, (b)Subuli«n siructurc 
$ho^•ing ih* posiiiort> of the amino ucid side chaiiu (yclloui. 




'^r^ X ^^olccviliir model of subiili-an E $ Wing the 1 2 aminu acid subsiity liow ihai increase enzyme acilvirv 
m D\IF lYoo and ArnoXL 1996). Yelbw amino acids wcrz -.emulated during scxccninTfl^^hlv^ 
specific cnzjmc aciivuy (Chen and Arnold. 1995), Red amino adds wcrt found durinc screening for loul 
(e»prcs^d)cnzyi>veacnviiy (You «nd Arnold. 1996): Calcium ions and pepudc subsiratc are shown in jray 
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.Vlcih:cular iiUHicI lU Ik pNU oicrasc >him iny fKuiiiurtx of aniihioiic iKnilmhcnzvl ctcr i^iibMnitC 
I. cai;»ly»ic roiiduw trcdi. unO m.x henwJiciit] niuiaiinm uccumulaietl during Uirccicd cvtiluiioa 
'<>ruii)cci fMttorc and ArmWd. r^fn. 
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b) Enzymes evolve for different environments 



Glyceraldchyde* 
3«phosphate dehydiogenase 
(GAPDH) 




GAPDHfrom 
Hcmarus americanus 
T„, = 50«C 



GAPDH from BactVfcj 
stearorhemtophilus 



GAPDHfrom 
Thermotoga maririma 



Fig. 3. Smanure IS conserved during evolution, while amino add scquenccfi and specific runctions art ofttn 
'^^if ' ^fL^??^** indicaicd appear to haw cvoNcd from a common ancestral aJS hand 

protdn. (b) Ttoa GAPDH enzymes isolated from diflmni orfanisms have yery umSar iiruaies. but 
quJiedifTcnnt iUbihties and amino ac?d lequmccgCBuehner « e/.. 1974; SicaizynsW « 1987-KoScSf- 

fer «r aL 1995). 



that they, loo, diverged from some common ancestor 
a lon& time ago to oocupy their cuxrcnc nidiea. The 
Thermotoga marftima bacterium ihiives at very high 
temperatures in ocean thermal vents; conse^jueniiy, lu 
enzyme* can tolerate much higher tempcratuits than 



the anaJogoua enzymes from an organism which 
grows under less extreme condidonfi, such as B, 
stearorhermcphilis. The Thcrmaxpget protcm unfolds 
at 9g*Ci while the same enzyme from the American 
Johster unfolds at only 50*C As widi the a/fi barrel 
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enzymes, the structural fold of the GAPDHs is highly 
conserved, while the detailed amino add sequences 
and specific properties axe not 

DIAECTED tVOLirnoNi EXPLORING NEW FUTURES 

The explosion of tools that has come out of molecu- 
lar biology during the last 20 years has made ii pos- 
sible far us to consider WoJving' the components of 
biological systems— DNA, RNA and protcins^or 
features never required in nature. Wc can both speed 
up the rate and channel the direction of evolution by 
controlling mutas^nesio the rate aod typca uf 
changes made-^nd the accompanying 'selection* 
pressures. As a result, processes that would lake mil- 
lions of years in nature can in principle be accomp- 
lished during the time gcak of a Ph D. thesis. By 
uncoupling the enzymes from the constraints of 
function within a living sygtem, we can step into 
and explore a variety of futures, futures that can 
include novel environments (evolution in a sea of 
methanol instead of water?) or even entirely new func- 
tions (cz2zymes to break down hazardous chemicals?). 
We oan explore questions such as 'can one catalytic 
activity become another, and how?' Furtbcrmore> by 
evolving new functions and thereby new solutions to 
molecular design problems, we learn things about 
these amazing molecular machines that might never 
be revealed if wc were to study only those that exist in 
nature. 

The possibilities for biotechnology are especially 
exciting. Directed evolution is a very practical ap- 
proach to tailor-making enzymes for a wide range of 
applications. In addition to buiWing enzymes with 
new features and functions, we can explore important 
questions such as *how might an enzyme dwnge its 
ficquenoe and propertica to break down or evade 
a dnigr We could conceivably anticipate in hiborat- 
ory experiments what might happen to drag resist- 
ances in nature. In directed evolution experiments we 
could also tune enzymes to function optimally under 
conditaons specified by us, rather than the context of 
the living organism in which it evolved. New enzymes 
could be evolved to carry out reactions never required 
by living organisms. 

DEVELOPINC A WORKINO STKArECY rOR JURZCttD 
ENZVMEtVOLimON 

In a directed evolution experiment, wc first generate 
a library o/ many different possibb toludons* to 
a problem. The next step is to find the canta solu- 
tion(s), enzymes that exhibit the desired propeny. 
A conceptual challenge comes in planning how to 
create this library of solutions. The number of pos- 
^ble enzymes one can make is so vast thai an explora- 
tion of their functions must be carefully guided in 
order to avoid becoming hopelessly lo^ A typical 
enzyme is a linear polymer of 300 amino acids. With 
20 possible amino adds at each position in the chain, 
there are 20'*^ posaibU different Unear combinations. 
If even only u snudl fracdon— say, 1 m 10^ all 



ihese sequences folds into a well-^iefined three-dimen- 
sional strucmrc, there are still more structured 
proteins than there are atoms in the universe! (Note 
that even in three billion years, nature has not 
had a chance to explore but a tiny fraction of 
the possibilities. This also means that there are very 
exciting possibilities for future evolution, including 
evolution in the test tube.) Because a random samp- 
ling of amino add sequences is unlikely to lead to the 
desired protein, wc must begin our exploration by 
starting from a point that we hope is close to where we 

want to be — an enryme that appra:irlmAt*< what w» 

want, but is not ideal. Then we evolve it, by ac- 
cumulating small changes, similar to what happens in 
nature. 

Nature is very good at searching mutant libraries 
for useful solutions. Unfavorable mutations are win^ 
nowed out at the same time as beneficial mutations 
are amplified, by linking the organism's growth rate 
and reproductive success to the performance of its 
components. In this process of sclectioru those organ- 
isms which grow faster quickly dominate, aflowing an 
efficient search of very large populations (10^ or more 
for bacteria). 

Unfortunately, many of the features that are of 
interest to us cannot be linked to the survival or 
growth of the host organism— the prerequisite to se- 
lection. Enzymes^ for example, can tolerate a variety 
of environments that cannot sustain life, so that the 
organism dies long before the enzyme has a chance to 
*show its stuff. For most problems of practical inter- 
est, in fact, mutant enzyme hbrarics must be screened 
rather than selcaed, one enzyme at a time. That is, 
the enzyme variants must be tested individually 
(screened) for the propeny of Interest. This unfortu- 
nate reality cffccrively Kmiu the search for improve- 
ments to mutant libraries contaimng perhaps I0*-10* 
variants, several orders of magnitude smaller 
than what one can search when survival depends on 
success. 

The strategy for molecular evolution is then illus- 
trated by calculating how many different sequences 
one can create by starting from a given enzyme and 
making a few amino add substitutions, as shown in 
Table 1. While there are only 5700 possible single 
mutants of a 300 amino add enzyme, there arc still 
more than 30 billion different sequences that differ 
from the original enzyme at only three positions. 
While a rapid screen might be able to cover a large 
Draction of all single mutants, and even some signifi- 
cant fraction of all double mutants, screening would 
be unable to gjvc more than a very sparse sampling Of 
the enzymes with multiple amino add substituuoos. 
Unless a vast majority of the mutations led to the 
desired property, dealing with a library of multiple 
mutations would be an experiment based on wishful 
thinking! (As might be expected for a finely nined 
molecular machine, most mutations are deleterious or 
at Ictt&t neuual; benefldal mutaiions are generally 
rare. The frequency with which one can expect to find 
bcnefidal mutations win depend on the extent to 
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Dfreciej evolubw crt^ting 
which ihc panicular feature of interest has already 
been optimized: the pathways up the mountain neces- 
sarily decrease in number as the pinnade is ap- 
proached) Because luck is gciicrally not an acccpubic 
baiii for the success of an experiment, the search is 
effectively limited to proteins whh sequences and 
therefore pTcpcities very similar to their parents. In 
addition, we must be able to tunc the rate or mode of 
mutation to produce libraries of primarily single 
amino acad substitutions. 

The principles and power of directed evolution arc 
best illustrated with examples. The first example will 
be the evolution of an enzyme to function in a polar 
organic solvent. It is well known that subiilisin, which 
normally cuts up peptides and proKans by cleaving 
the peptide bonds linking the amino adds together, 
wiU also catalyse peptide bond formation. Peptide 
bond formation is favored in organic media, as water 
participates in unwanted side reactions a« wdl as 
hydrolysis of the product. Subtili«n acluaHy ranairts 
folded and reasonably suble in high concentrations of 
polar organic solvents such a» dimethylformamide 
(DMFl. Unfortunately, the catalytic aciiviTy is very 
low. There is no fundamental reason, however, why 
subfilisin could not function in DMF— the enzyme's 
unhappiness reflects a balance among a very large 
number of noncovalent interactions in ihe syS' 
tern— protein, solvent, substrates and products— a 
balance that is adversely aflected when the protein is 
dissolved in a nonaqueous medium, Because these 
complex interactions art poorjy understood, we could 
not address this problem by a rational design ap* 
proach. Wc therefore took the 'irrational' approach 
and asked whether wc could 'evolve' a subtilisin that 
would function well k DMF(ChcD and Arnold. 1991 
1993; You and Arnold, 1996), 

The arguments set out above led us to the strategy 
for directing the evolution of an enzyme to perform 
a new function (or, in this case of aubti&sin, an old 
function but under new conditions) illustrated in 
Fig. 4. In comparison to the enzyme performing 
a function for which it is selected, peptide hydrolysis 
in aqueous media, the new job is performed poorly 
indeed Subtilisin has not been selected for hydrolysis 
in DMF, and there is, not surprisingly, a great deal of 
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room for improvement. Because it is feasible to search 
only those subliUsin mutant* with one or two amino 
acid substitutions, wc will create and screen u Ubrary 
of such mutants for progeny slightly better than their 
parent. Tht screening method for identifying us^^u! 
mutations should ensure that the expected small en^ 
hancemms brought about mainly by single mutations 
can be measured. Although these progenies will gencr* 
ally resemble their parents, after many generations 
new features can develop, such that the dcscendcnts 
can be quite different from their ancestor. Therefore, 
the gen^ation of new, useful enzymes also relies on 
having an effective strategy for accumulating many such 
small improvements. One such strategy involves carry, 
ingout sequential generations of random mutagenesis 
on the gene (DN A sequence coding for the enzyme) to 
create a mutant library, coupled with screening of the 
resulting proteins. In each generation a single variant 
is chosen as the parent for the next generation, and 
sequendai cycles a»ow the evolution of the desired 
features. 

We implemented this strategy to evolve subtilisin to 
function in DMF. A powerful molecular biology tool, 
the polymerase chain reaction (PGR), was used to 
make milUons of copies of the gene that codes for the 
natural, or wild-type enzyme. By canying out this 
(enzymatic) reaction under sub-optimal conditions, 
we could iatroduoe base substitutions randomly 
throughout the DNA at a controllable rate. At the end 
of this reaction wc have millions of gene copies; most 
slightly different from the wild-type one. These genes 
are placed back into a circular double-stranded piece 
of DNA (a plasmid) that contains all the Instructions 
the bacterial cells need to uanslaic the DNA into 
protein. When the bacteria arc cransfonned with these 
plasmids, we have millions of individual chemical 
factories, each producing a different variant of the 
original enzyme. 
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Table 1. The m olecular evolution number pfohlcm' 
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5700 
16,19a850 
30.557,53a,«)0 
43,109,036,717,100 
48.489.044.499,400u000 
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Note: Surting v^ih an anzyine of 30O nmino acids, the 
r."i?Q\2^ n,il?5JSiS" "^"^^'"'ns W *cid suhtlhutions 



Fje. 4. A working siratcgy for dlreeced en^yms evolution. 
The screening method should casure that imaU enhance- 
ments brouRbt about mainly by xingia muutions can ba 
measured. The evolution of a new, uierul enxyme tt<^wm an 
eflfeaivc tttaieay for aoeumuUting many tucfa tmaU iro- 
provenjentt. 
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Fig 5. Resiilu ol directed evolution ofsuEniiisin for activity 
in DMF by sequential seneraiiozis ofraodQin mutaj^enesis 
and screening. The accumulatJon of 12 amuio acid sub^tu- 
ttons in te<)uentta] gcQcrationi of random mutaseneus and 
screening resulted in an caiyme >500-fold more active than 
the wild-type eniymc in 60% DMF. 



Next, the bacterial colony or colonics which pro- 
duce a subtilisin that is more active in DMF must be 
found. In this early experiment our scmning strategy 
was crude, but effective. Because subtilisin is secreted 
from the bacilli, ihe variants could be screened vis- 
ually on nutrient plates containing a protein (casein), 
in the presence and absence of DMF. The active 
enzyme creates a visible *halo' surrounding the bacter- 
ial colony whose size is proportional to the catalytic 
activKiy • Variants vvith higher hydrolysis activity 
than vrild-typc on the DMF-cootainiog plates 
could be identified from their bigger halos (Qien and 
Arnold, 1991). 

The results of the directed evolution cflbn are sum- 
marized in Fig. 5. At first we identified three amino 
add substitutions that individually improved the 
wild-iype enzyme's activity several-fold, Using ale- 
directed mutagenesis we combined those three with 
a fourth mutation reported to improve acuvity and 
stability in other subtilisins. to obtain a four armno 
acid variant about 40-fold more active than wild-type 
in 60% DMF (Chen and Arnold, 1991). Since the 
process of sequencing the genes of all the positive 
variants and then combining the mutations by site- 
directed mutagenesis was laborious, we decided to 
carry out sequential generations of random mutagen- ^' 
csis and screening, no longer stopping on the way to 
sequence the intermediates. Applying an additional 
six generations of mutagenesis axid screening it icw 
hundred colonies in each generation, we created an 



*Bcctwje halo size also depends on enzyme expression 
leve), enzyme dilTusion and coiony uze, it \$ useful for 
a -rough cue. Positivci were confirmed by a second level of 
screening in liquid culture (Chen and Amotd, i993)l 



enzyme that is more than 500-fold more active in 60% 
DMF than the wild-type subtilisin E (You and Ar- 
nold. 1996). This enzyme exhibits substanHal activity 
even in 85% DMF. Ihe whole process was surpris- 
ingly rapid: a total of only about J 0.000 colonics were 
screened to obtain a huge improvement in catalytic 
activity. 

The gene for the final evolved enzyme was se- 
quenced to determine the amino add subsututions 
that allowed this eaizyme to recover its activity in 
DMF. Of 275 amino adds, 12 were altered; their 
positions are indicated in Fig. 6. Although the DNA 
substitutions are targeted randomly throughout the 
entire subtilisin gene sequence, the amino add subsd- 
tutions that enhance catalytic activity are all posi- 
tioned on the surface of the enzyme, surrounding the 
active site and substrate binding pocket The majority 
arc in evolutionarily variable loops that connect ele- 
ments of conserved secondary structures (helices and 
sheets) (<3icn and Arnold, 1993; You and Arnold, 
1996), This information could of course be utilized in 
developing more 'rational* design suategies, induding 
narrowing the sequences exposed to random 
mutagenesis in directed evolution. 

Finally, it is worth noting that the resulting enzyme 
is indeed a far more ejQSdent catalyst than wild-type 
subtilisin for the polymerization of amino adds. This 
evolved enzyme can catalyse, for example, the forma- 
tion of poiy-i^njcthioninc starting from a raccmic 
mixtutt of methionine methyl ester. The evolved en- 
zyme allows the synthesis of significandy longer poly- 
mers and at much higher yields than the native en* 
zyme in 60-70% DMF (Zhao, H. unpublished re- 
suits). 

The advantage of directed evolution over site-di- 
rected motagenesis is dear, the same amount of effort 
could support the construction and screening of at 
most a few dozen variants with mutadons directed to 
specific locations. Without a ckar mechanism, it 
would be diiSculi indeed to pinpoint 12 amino add 
substitutions that enhance cataytic activity m DMF. 
Even then, single dte-dirccted mutations would have 
to be accumulated to create a useful enzyme, itself 
a substantial mutagenesis effon faivolving trial and 
error to find optimal combinarions. 

The most attractive leature of the evolutionary 
strategy oudincd in Fig. 4 is its simplidiy. It is pos- 
sible, however, that this simple *up-hill chmb' ap- 
proach is not an optimal approach to the evolution of 
a particular enzyme Them are obviously a great num- 
bt:r of pathways possible for the evolution of a pro- 
tein, and each choice of parent for the next generation 
represents an irreversible step along one particular 
pathway. What would happen if we simply repeated 
the experiment? Depending on which pathway was 
chosen or which mutation happened to be found first, 
the enzyme could end up on a local oprimum, unable 
to evolve further. This approach may also appear 
slow; improvements are small in each step and neces- 
sarily become harder to find the doscr the enzyme 
gets to an optimum. 



An ai 
rcccnll> 
vantage 
Gene re 
genes oi 
speed oJ 
benefict. 
remove 
tion int< 
genes 
combine 
dircned 

Weh 
ating an 
sis of tl 
anttbiot 
ing grou 
of cepb 
moval J 
recover} 
large ar 
a major 
effort so 
perform 
aL 197( 
activity 
by scree 
the cnz; 
requirec 
compcti 

We V 
esterase 
presena 
achieve 
to belie^ 
natural 
unknou 
pNB es 
very scr 
tures w 
setting, 
through 

Thev 
cdls in ^ 
tion tha 
screenin 
used fo 
drolysis 
formanc 
able for 
therefor 
ilar. but 
order to 
screcnir 
wells of 
spectro; 
ance in 

Usini 
colonic* 



i 



Directed cvclutioA: cmUtig Inocttalysts Tor ifae future 



SEX IN THE TEST TUBE 

An alteinttdve (Urccted evoludon strategy we tuve 
recently explored incorporates some importanr ad- 
vant4i^ attributed lo sex io the evolutionary process. 
Gene recombination, the cutting and pasting of whole 
genes or pieces of genes, can significantly Increase the 
speed of molecular evolution by rapidly accumulating 
bene&dal mutations and providing a mechanism to 
remove deleterious ones. To incorporate recombina- 
tion into directed evolution, we randomly recombine 
genes with positive mutations. A search for better 
combinations of mutations completes a generation of 
directed evolution. 

We have tested this new 'sexual* approach by cre- 
ating an enzyme that efficiently catalyses the hydroly- 
sis of the p^nitrobenzyl (pNB) ester of a ^4actam 
antibiotic in the presence ofDMF. The pNB protect- 
ing group is often used during the large-scale synthesis 
of cephalosporin-type antibiotics. Its selective re- 
moval presents problems, however, particularly for 
recovery and disposal of the zinc catalyst and the 
large amounts of arganic solvents used. Thcpcforc, 
a major pharmaoeutical company devoted significant 
effort some years ago to finding an enzyme that wonld 
perform this selective hydrolysis reaction (Brannon et 
aL 1976; Zock ct 1994). An enzyme with some 
activity towards pNB ester hydrolysis was identified 
by screcmng a large number of microorganisms, but 
the enzyme's low activity, especially in the solvents 
required to solubilize these materials, made it a poor 
competitor to the dassical chemical catalyst. 

We were challenged in 1994 to evolve a pNB 
esterase with much higher activity, particularly in the 
presence of chc polar organic solvents required to 
achieve high substrate solubility. We had two reasons 
to bdtcve that this could be done. First, the enzyme's 
natural function and, therefore, natural substrates are 
unknown, but they tie unlikely to be the antibiotic 
p>4B esters. Second, the natural enzyme's acdvity is 
very scnsiuve to organic solvents. B«:ause these fea- 
tures were never required in the enzyme^s natural 
setting, we could expect considerable improvement 
through directed evolution. 

The wild-type esterase is not secreted by the R coii 
cells in which it is made, nor does It cany out a reac- 
tion that is easily measured. Thus, we had to develop 
screening strategies more sophisticated than those 
used for the subtilisin. The j^nitrobcnzyl ester hy* 
droJysis reaction is assayed laboriously by high per- 
formance liquid chromatography, a method unsuit- 
able for screening tens of thousands of colonies. We 
therefore devised a rapid screening assay using a sim- 
ilar, but not identical, p-nitropbenyl ester substrate, in 
order to have an tasyto-rcad colorimetric signaL The 
screening reactions could then be carried out in the 95 
wells of a plastic microtitcr plate, using an automatic 
spectrophotometer to read and analyse the absorb* 
ance in all 96 wells at once. 

Using this rapid assay to screen about a thousand 
colonies per generation, we completed several sequcn- 
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tial cycles of random PCR mutagenesis and screening, 
as illusUated hi Fig. 7 (Moore and AmoW. 1996). 
After four generations, the enzyme's specific activity in 
15% DMF had improved 15-fold. In the fourth 
generation, we collected not one, but 64 different 
clones, some of which were better than the parent, 
and many of which were not. The purpose for this was 
two^fold. First we wanted to make sure that our 
screening strategy was working properly lo give us an 
enzyme that would catalyse the desired p-nitrobenzyl 
hydrolysis reaction, not only the colorimetric p-ni- 
tiophenyl screening reaction. The activities of each of 
the 64 clones in both reactions are compared in Fig. 8. 
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Fig, 7. Oiiected evcluthn of pNB esterase in 15% OMF 
involved four generations of random mttugenests and 
screening, followed by one round of recombmadon of the 
five best geaes from gescration 4. The best variant obtained 
alter four generations is iS-fold more active than wUd^iypc 
The best variant from screeahig 400 colonies of the rooombi- 
nation poo) is '«30-(bld more active than wild-type. 
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RAtc of hydrolysis of screening cubsirmte 
(fdative to 3fd generation-paieDO 

Fig. 8. Companson of aedviiies on target (^nitrobenzyl) 
and scrteniDg (^nitmphenyl) substrate of 64 pNB esterase 
variants Isolated titer Ibunb generation of random mutagcfi- 
esis and sercening, rdative to parent enzyme £mm tho third 
generatioa The five most active variants (tniide oval) were 
pooled for random recombination (see Fig. 9). 
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If the screening reaction pcrfectJy mimicked the de- 
sired rtaaion, all the points would ]ie on the 
45° line. Although somewhat scanered, there is none 
the leis a reasonable correlation! the rapid screen 
provides an indication of evolution of the desired 
activity that is acceptable for making a rough cut of 
positive clones. 

The second reason for studying this group of vari- 
ants was to test the alternate, sexual approach for 
accumulating effective mutations. We thus collected 
the hve best mutants, those in the dotted oval in 
Fig. 8. and rccombined them using a "sexuzP PCR 
method recently described by Stemmer (1994a, b). 
How the genes arc randomly recombincd is shown 
schematically in Fig. 9(a). The genes are pooled in the 
test tube and fragmented with an enzyme that cuts the 
DNA at random positions- In Fig. 9{b). the polyac- 
ryiamide dcctropboresis gcl that separates the DNA 
fragments by length shows that the DNA has been 
digested into a smear of different-siied pieces. Wc 
colleaed the fragments 200^300 base pairs in length 
by extraaing the DNA from the appropriate piece of 
gel The full-length gene can be reassembled from this 
pool of random fragments, again using the PCR tech- 
nology, to create a new gene library in which the 
mutations were present in their different possible 
combinations. These reassembled^ recombined genes 
were inserted back into the plasmid and expressed in 
the E. colt The best of those rccombined genes were 
identified, as before, by screening the enzymes they 
code for and produce in the microorganisms. 

Screening only '^400 coloziies yielded eight clones 
with activity significantly greater than the best of the 
five parents — this yield of positives is at least 20-fold 
hi^er than we found by screening the genes with 
point mutations alone (typically 1/1^). Recombina- 
tion can enhance directed evolution by making use of 
the information present in a population of improved 
enzymes produced by mutagenesis and screening, in- 
fonnation that would otherwise be discarded. Thus 
far, we have improved the cnzymc*s specific activity 
towards the antibiotic substrate more than 30-foId in 
15% DMF. The total expressed activity is at least 
SCVfold greater than the original system wc started with. 

Sequencing of the genes coding for improved en- 
zymes once again allowed us to identify rhe amino 
acid substituqons responsible for the observed im- 
provements in catalytic performance. Six effeaivc 
mutations are illustrated in Fig. 10, on a model of the 
pNB esterase developed from the X»ray crystal struc* 
ture of a homologous enzyme (Moore and Arnold, 
1996), As for the case of subdlisin. most of the muta* 
tions are at or near the solvent^accessible surface. 
Only one of the six is deeply buried. In contnuci to 
subtilism, however, none of the effeaivc amino add 
substitutions lie in segments of the esterase pi^dictcd 
to interact direaly with the bound substrate. It is 
possible that the homology modeling yielded an in- 
correct structure, and the mutations do interact with 
the p-nitrobenzyl substrate. Or, it may be that the 
amino acd substitutions sampled at positions adjac- 



ent to the substrate were all deleterious, and small 
improvements were only obtained by altering amino 
acids further away. In any case, the mechanism(G) by 
which these amino acid substitutions enhance the 
catalytic activity of the evohed pNB cstereses are subtle 
and would have been very difBcult to predict in advance. 

CONCLUSIONS 

The directed evolution approach clearly allows us 
to engineer enzymes with novel functions and fea- 
tures. In coQtrasr to 'lationai* design approaches, di- 
rected evolution can be applied even when very little 
is known about an enzyme's structure or catalytic 
nvechanjsm. Since the vast majority of proteins remain 
largely uncharacteHzed. this marks a huge advantage 
for the evolutionary methods. This approach, because 
it allows us to explore novel solutions to protein 
design problems, also promises to teach us a great 
deal about protein structure and functioiL 

Future research in directed evolution will indude 
development of large-scale screening methods, so that 
eSident searches of large mutant libraries can be per- 
formed. The construction of optimized mutant libraries 
will also decrease the need tar screening. In addition to 
streamhning efforts to "tune* enzymes, these impiove- 
rocnts will allow larger leaps — such as the evolution of 
new catalytic activities — ^to take place. Significant im- 
provements in the ease and power ofdirecied evolution 
wiU also come itom optimizing the seard} strategies. 
The many similarities to optimization problems in 
other fields make this a fertile ground for collaborative 
efforts aitiong theorctidans and expeiimentaUsts from 
a wide range of engineering disciplines. 
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Fig. 9. Recombination of muiatioof by gene ihuffliag. (a) 
'Sexuar PCR method (Stemmer. 1994a,b] iavolves randonn 
digestion of ihc gene pool usiqg DNasc enzyme, followed by 
g^ne reassembly using PCR. Kcaisembled genes contain the 
diHcreni eombinadons of muutions. 
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Ft^ 9. (b) Top polyacryltmlde elrccfophoF&ds ^ shows ibe »cparauon of ihc digested gone fnement^ by size. Fn^cais 
3QO-300 bisc pairs Inng were recovered by extraction the excised gel sclent. These were rgoMemblcd tMo ifae (ull-lcAgUt 
gene (bottom gel lanes 3-j and 6-8). First lane on left h a 'ladder' of DNA of kndwn roolecuhr weights. 
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